OpenAI CEO Sam Altman confirmed that GPT-4's training cost exceeded $100 million, while Google's Gemini Ultra reached an estimated $191 million -- representing a 287,000x increase from the cost of training a Transformer model in 2017 ($670). These costs concentrate frontier AI development in a handful of well-resourced corporations. Academic researchers, as noted by cognitive scientist Sean Trott, find it "hard to run (and even harder to train) state-of-the-art LLMs on an academic budget, which limits the kinds of questions we can ask," creating a structural divide where publicly funded researchers cannot compete with or verify the claims of private AI labs.
Discipline at a Glance
What the evidence shows for Data Scientists & Computational Researchers
Data Scientists & Computational Researchers are represented here through 12 documented evidence items spanning 5 advocacy pillars.
Research Software Engineers maintain infrastructure that 90-95% of researchers depend on, yet face 194 fragmented job titles, no standard career ladder, and exclusion from authorship and grant eligibility. Data scientists are hired as analysts but spend 80% of their time as "data janitors." Bioinformaticians with PhDs earn less than ML engineers without domain science requirements. Dataset creators receive no attribution when their work is scraped for AI training. Across every sub-type, the computational labor that enables modern research and AI development is systematically rendered invisible by the institutions and corporations that depend on it.
Evidence by Pillar
Each section below draws directly from the niche challenge evidence set for this discipline.
Sustainable Income
3 evidence items
Despite requiring advanced degrees (often PhDs) and expertise spanning biology, statistics, and computer science, bioinformatics professionals face severe pay disparity. The bottom 10% of biological scientists earn just $54,500 annually, while average bioinformatics scientist salaries range from $85,012 to $116,054 depending on the source -- substantially below the $141,000-$250,000 range for ML engineers with comparable technical skills but no domain science requirement. Many bioinformaticians "know they're underpaid but don't know where to start" negotiating, and geographic concentration means those outside San Francisco, Boston, and San Diego earn 20-30% less than national averages.
The ninth annual State of Open Data report from Digital Science, Figshare, and Springer Nature found that while open data is "on the edge of becoming a recognized global standard," critical equity gaps persist. Average repository sharing rates hover around 25% in wealthy nations (US, UK, Germany, France) but remain "significantly below a quarter" in Brazil, Ethiopia, and India. Researchers face unfunded mandates to share data openly while corporations commercialize those same datasets. The report warns of "a potential divide where open science becomes the preserve of better-resourced research environments, potentially marginalizing researchers in low- and middle-income countries."
Well-being
2 evidence items
If you or someone you know is struggling
Immediate support is available now. Call or text 988, text HOME to 741741, or call 1-800-662-HELP (4357).
A 365 Data Science study of 1,001 data scientist LinkedIn profiles across the US (35%), UK (25%), EU (25%), and India (15%) found that data scientists stay with their current employer for an average of just 1.7 years. A related survey of 600 data engineers found that 97% experienced burnout in their day-to-day work, with 79% considering leaving the industry entirely. The primary causes stem from organizational misunderstanding of the role, unrealistic analytics expectations, and the "data janitor" problem -- where professionals hired for advanced analysis spend up to 80% of their time on data cleaning and pipeline maintenance.
Despite headlines about AI growth, the data science job market has bifurcated dramatically. Tech companies laid off approximately 237,000 workers across 1,107 companies in 2024, with data and ML roles hit alongside broader engineering cuts. Companies are "no longer hiring ML engineers to 'figure out our AI strategy'" -- they want specialists who can ship production systems, creating a two-tier market where experienced practitioners command $185,000-$285,000 while entry-level data scientists face a saturated, contracting market. The paradox: organizations simultaneously claim they cannot find enough AI talent while eliminating data science positions in favor of automated ML pipelines and AI-as-a-service solutions.
Discovery & Ranking
3 evidence items
A comprehensive survey found that 90-95% of researchers in the US and UK rely on research software, and more than 63% reported they could not continue their work if such software stopped functioning. Yet Research Software Engineers (RSEs) who build and maintain this critical infrastructure remain largely invisible in academia -- unable to earn authorship credit on papers, excluded from traditional promotion criteria, and lacking formal career paths. The US-RSE community has grown to 2,800 members advocating for recognition, but as a Princeton University study documented, "lack of a clear career path" remains RSEs' top concern, with 194 different job titles fragmenting the profession.
Princeton University's RSE Group documented the systemic absence of career structures for research software engineers in academia. When polled, RSE Group members identified "lack of a clear career path" as their top concern. Rapid expansion of RSE programs combined with retention challenges and limited promotion paths were "amplified by growing demand for RSEs in the private sector, which added risk of turnover." Traditional academic evaluation metrics -- publications, citations, grants -- fail to capture RSE contributions, and universities struggle to align RSE work with existing HR frameworks, forcing these professionals into ill-fitting job classifications that undervalue their expertise.
Source: Designing and Implementing a Comprehensive Research Software Engineer Career Ladder: A Case Study from Princeton UniversityA white paper submitted to the Heliophysics Decadal Survey documented that Research Software Engineers "receive unequal treatment compared to their science counterparts, including lack of credit for their contributions and insufficient training." Despite being essential to 63% of US researchers who cannot continue their work without software, RSEs are systematically excluded from authorship, grant PI eligibility, and academic recognition systems. The paper advocates for RSEs to receive "equality of contribution" -- equivalent credit, career advancement opportunities, and institutional support as domain scientists -- arguing that the current system exploits technical labor while reserving prestige and funding for traditional research roles.
Preservation & Portability
2 evidence items
A GigaScience study analyzed 27,271 Jupyter notebooks from 2,660 GitHub repositories associated with 3,467 biomedical publications. Of 10,388 notebooks with successfully installed dependencies, only 1,203 ran without errors -- and just 879 (approximately 5.9%) produced results identical to the originals. The study found that journals including Nature and Nucleic Acids Research had exception rates well above 50%. Dependency management failures, outdated Python versions, and undeclared dependencies represent systemic barriers to computational reproducibility, threatening the integrity of data-driven scientific research.
A comprehensive review in AI Magazine identified unique obstacles to reproducibility in machine learning research: sensitivity to training conditions, sources of randomness, inherent nondeterminism, and prohibitive computational costs. The reproducibility crisis is quantifiably severe, with one estimate citing "an annual $200 billion global drain on scientific computing resources" from irreproducible computational work. Unlike traditional science where methods can be independently replicated, ML experiments often cannot be verified because the compute costs are too high, the training data is proprietary, or the random seeds and hyperparameters are undocumented -- eroding the scientific foundation of data-driven research.
Safety & Harassment
2 evidence items
A Proof News investigation revealed that Nvidia scraped YouTube videos to train its Cosmos AI model, with VP of Research Ming-Yu Liu writing in an internal email about building "a video data factory that can yield a human lifetime visual experience worth of training data per day." Nvidia used dozens of virtual machines with rotating IP addresses to evade YouTube's detection systems. Companies including Meta, Microsoft, and Nvidia extracted over 15.8 million videos from more than 2 million YouTube channels without creator consent, prompting class-action lawsuits alleging unjust enrichment and unfair competition.
MIT Technology Review documented how AI companies have systematically scraped training data without consent or compensation, prompting a wave of lawsuits and licensing deals. The New York Times alleges millions of copyrighted articles were used to train AI models without consent. LinkedIn faces a class-action lawsuit for allegedly harvesting private messages for AI training. Reddit sued Perplexity AI for obtaining data "using false identities, proxies and other antisecurity techniques." Meanwhile, data scientists and researchers whose public datasets, code, and analyses were scraped for training have no mechanism for attribution, opt-out, or compensation -- their computational work treated as raw material for corporate AI products.
Source: AI companies are finally being forced to cough up for training dataIf you or someone you know is struggling
These are verified live resources for immediate support. If the evidence on this page feels close to home, use one of them before you keep reading.
988 Suicide & Crisis Lifeline
Free, confidential support available 24/7 in the United States.
Crisis Text Line
Free crisis counseling by text, 24/7.
SAMHSA National Helpline
Free, confidential treatment referral and information service, 24/7, in English and Spanish.
Verified against live destinations on April 13, 2026.
How this discipline connects to the wider crisis
The same discipline-level evidence maps cleanly into the site’s issue pages and public policy framing.
Sustainable Income
Micro-payments, opaque splits, and exploitative contract terms that keep creators from earning a living.
Open issue pageWell-being
Burnout, lack of healthcare, mental health crises, and the human cost of creative gig work.
Open issue pageDiscovery & Ranking
Algorithmic gatekeeping, pay-to-play promotion, and monopoly control over who gets seen.
Open issue pagePreservation & Portability
Platform lock-in, format obsolescence, and the risk of losing creative work when services shut down.
Open issue pageSafety & Harassment
Online abuse, content theft, deepfakes, and the failure of platforms to protect creators.
Open issue pagePatterns already visible in the source material
These synthesis themes come directly from the niche challenge sheet for this discipline.
Invisible Labor & Structural Non-Recognition
Research Software Engineers maintain infrastructure that 90-95% of researchers depend on, yet face 194 fragmented job titles, no standard career ladder, and exclusion from authorship and grant eligibility. Data scientists are hired as analysts but spend 80% of their time as "data janitors." Bioinformaticians with PhDs earn less than ML engineers without domain science requirements. Dataset creators receive no attribution when their work is scraped for AI training. Across every sub-type, the computational labor that enables modern research and AI development is systematically rendered invisible by the institutions and corporations that depend on it.
Compute Inequality & Commercialization Without Consent
GPT-4's $100 million training cost and Gemini Ultra's $191 million represent a 287,000x increase since 2017, concentrating frontier AI development in a handful of corporations while academic researchers cannot afford to verify their claims. Nvidia scraped 15.8 million YouTube videos using evasion tactics to train its Cosmos model. Open data mandates push researchers to share freely while corporations harvest that data for proprietary products. The result is a one-directional value extraction pipeline: researchers create, clean, and share data at their own expense; corporations scrape, train, and commercialize it without consent, attribution, or compensation.
Reproducibility Crisis & Burnout Epidemic
Only 5.9% of biomedical Jupyter notebooks produce identical results, and irreproducible computational work drains an estimated $200 billion annually from global scientific computing. ML experiments often cannot be verified due to proprietary data, undocumented parameters, and prohibitive compute costs. Meanwhile, 97% of data engineers report burnout, data scientists average just 1.7 years per job, and 79% of data professionals have considered leaving the field entirely. The people who build the computational foundations of modern science and AI are burning out faster than any institutional response can address, while the work they produce cannot be reliably reproduced or preserved.
Who this evidence already accounts for
These roles and subtypes appear directly in the current discipline sheet.
Data Scientists
Computational Biologists / Data Scientists
Machine Learning Engineers
Included as a documented subtype in the source sheet.
Computational Biologists
Computational Biologists / Data Scientists
Bioinformaticians
Bioinformaticians / Computational Biologists
Research Software Engineers
Research Software Engineers
Algorithm Designers
Algorithm Designers / ML Engineers
Keep exploring the same system from another angle
Stand with creators
The challenges facing data scientists & computational researchers creators are documented in the evidence above. Sign the declaration to back a better future for creative work.