Data Scientists & Computational Researchers

A collection of 12 high-quality evidence items documenting systemic challenges facing data scientists and computational researchers. From the computational reproducibility crisis where only 5.9% of biomedical Jupyter notebooks produce identical results, to Nvidia scraping "a human lifetime" of YouTube videos per day without creator consent, to GPT-4 training costs exceeding $100 million that lock academic researchers out of frontier AI -- this niche operates at the intersection of scientific discovery and invisible labor. Research software engineers underpin 90-95% of modern research yet lack formal career paths, data scientists leave their jobs after an average of just 1.7 years due to burnout and role misunderstanding, and bioinformaticians with PhDs earn less than entry-level software engineers. Meanwhile, open data mandates push researchers to share freely while corporations commercialize their datasets without attribution or compensation.

Sign the Declaration Explore all creators

Discipline at a Glance

Evidence Items

Sourced from reporting, studies, and creator testimony

Creator Subtypes

Data Scientists, Machine Learning Engineers, Computational Biologists

Creator Roles Documented

Unique roles named inside the evidence set

Pillars Covered

Out of the 5 STC advocacy pillars

What the evidence shows for Data Scientists & Computational Researchers

Data Scientists & Computational Researchers are represented here through 12 documented evidence items spanning 5 advocacy pillars.

Research Software Engineers maintain infrastructure that 90-95% of researchers depend on, yet face 194 fragmented job titles, no standard career ladder, and exclusion from authorship and grant eligibility. Data scientists are hired as analysts but spend 80% of their time as "data janitors." Bioinformaticians with PhDs earn less than ML engineers without domain science requirements. Dataset creators receive no attribution when their work is scraped for AI training. Across every sub-type, the computational labor that enables modern research and AI development is systematically rendered invisible by the institutions and corporations that depend on it.

Evidence by Pillar

Each section below draws directly from the niche challenge evidence set for this discipline.

Sustainable Income

3 evidence items

View issue page

#3Compute Cost Inequality & Academic Exclusion2025-03 · ML Engineers / Algorithm Designers

OpenAI CEO Sam Altman confirmed that GPT-4's training cost exceeded $100 million, while Google's Gemini Ultra reached an estimated $191 million -- representing a 287,000x increase from the cost of training a Transformer model in 2017 ($670). These costs concentrate frontier AI development in a handful of well-resourced corporations. Academic researchers, as noted by cognitive scientist Sean Trott, find it "hard to run (and even harder to train) state-of-the-art LLMs on an academic budget, which limits the kinds of questions we can ask," creating a structural divide where publicly funded researchers cannot compete with or verify the claims of private AI labs.

$100 million GPT-4 training cost

$191 million Google Gemini Ultra estimated training cost

287,000x increase from 2017 Transformer training cost ($670)

$670 cost to train a Transformer model in 2017

Source: AI Cheat Sheet: Large Language Foundation Model Training Costs

#6Compensation Gap & Structural Underpayment2024 · Bioinformaticians / Computational Biologists

Despite requiring advanced degrees (often PhDs) and expertise spanning biology, statistics, and computer science, bioinformatics professionals face severe pay disparity. The bottom 10% of biological scientists earn just $54,500 annually, while average bioinformatics scientist salaries range from $85,012 to $116,054 depending on the source -- substantially below the $141,000-$250,000 range for ML engineers with comparable technical skills but no domain science requirement. Many bioinformaticians "know they're underpaid but don't know where to start" negotiating, and geographic concentration means those outside San Francisco, Boston, and San Diego earn 20-30% less than national averages.

$54,500 annual earnings for bottom 10% of biological scientists

$85,012-$116,054 average bioinformatics scientist salary range

$141,000-$250,000 ML engineer salary range for comparison

20-30% pay reduction for bioinformaticians outside major hubs

Source: Bioinformatics Salary Secrets: Why You're Probably Getting Paid Way Less Than You Should

#8Open Data Mandate Without Creator Compensation2024-12 · Data Scientists / Computational Researchers

The ninth annual State of Open Data report from Digital Science, Figshare, and Springer Nature found that while open data is "on the edge of becoming a recognized global standard," critical equity gaps persist. Average repository sharing rates hover around 25% in wealthy nations (US, UK, Germany, France) but remain "significantly below a quarter" in Brazil, Ethiopia, and India. Researchers face unfunded mandates to share data openly while corporations commercialize those same datasets. The report warns of "a potential divide where open science becomes the preserve of better-resourced research environments, potentially marginalizing researchers in low- and middle-income countries."

25% average repository sharing rate in wealthy nations

Source: The state of Open Data 2024: Special report -- Bridging Policy and Practice in Data Sharing

Well-being

2 evidence items

View issue page

If you or someone you know is struggling

Immediate support is available now. Call or text 988, text HOME to 741741, or call 1-800-662-HELP (4357).

#5Burnout & Unsustainable Tenure2021-10 · Data Scientists

A 365 Data Science study of 1,001 data scientist LinkedIn profiles across the US (35%), UK (25%), EU (25%), and India (15%) found that data scientists stay with their current employer for an average of just 1.7 years. A related survey of 600 data engineers found that 97% experienced burnout in their day-to-day work, with 79% considering leaving the industry entirely. The primary causes stem from organizational misunderstanding of the role, unrealistic analytics expectations, and the "data janitor" problem -- where professionals hired for advanced analysis spend up to 80% of their time on data cleaning and pipeline maintenance.

1,001 data scientist LinkedIn profiles studied

1.7 years average data scientist tenure with employer

97% data engineers experiencing burnout

79% data professionals considering leaving the industry

80% time spent on data cleaning rather than advanced analysis

Source: Study Reveals High Turnover Rates Among Data Science Professionals

#11Job Market Polarization & Role Displacement2024-09 · Data Scientists / ML Engineers

Despite headlines about AI growth, the data science job market has bifurcated dramatically. Tech companies laid off approximately 237,000 workers across 1,107 companies in 2024, with data and ML roles hit alongside broader engineering cuts. Companies are "no longer hiring ML engineers to 'figure out our AI strategy'" -- they want specialists who can ship production systems, creating a two-tier market where experienced practitioners command $185,000-$285,000 while entry-level data scientists face a saturated, contracting market. The paradox: organizations simultaneously claim they cannot find enough AI talent while eliminating data science positions in favor of automated ML pipelines and AI-as-a-service solutions.

237,000 tech workers laid off across 1,107 companies in 2024

1,107 companies with layoffs in 2024

$185,000-$285,000 salary range for experienced ML practitioners

Source: Is the AI and Data Job Market Dead?

Discovery & Ranking

3 evidence items

View issue page

#4Invisible Infrastructure & Recognition Deficit2022-05 · Research Software Engineers

A comprehensive survey found that 90-95% of researchers in the US and UK rely on research software, and more than 63% reported they could not continue their work if such software stopped functioning. Yet Research Software Engineers (RSEs) who build and maintain this critical infrastructure remain largely invisible in academia -- unable to earn authorship credit on papers, excluded from traditional promotion criteria, and lacking formal career paths. The US-RSE community has grown to 2,800 members advocating for recognition, but as a Princeton University study documented, "lack of a clear career path" remains RSEs' top concern, with 194 different job titles fragmenting the profession.

90-95% researchers relying on research software

63% researchers who could not continue work without research software

2,800 US-RSE community members

194 different job titles fragmenting the RSE profession

Source: A survey of the state of the practice for research software in the United States

#7Career Path Absence & Institutional Misclassification2025-02 · Research Software Engineers

Princeton University's RSE Group documented the systemic absence of career structures for research software engineers in academia. When polled, RSE Group members identified "lack of a clear career path" as their top concern. Rapid expansion of RSE programs combined with retention challenges and limited promotion paths were "amplified by growing demand for RSEs in the private sector, which added risk of turnover." Traditional academic evaluation metrics -- publications, citations, grants -- fail to capture RSE contributions, and universities struggle to align RSE work with existing HR frameworks, forcing these professionals into ill-fitting job classifications that undervalue their expertise.

Source: Designing and Implementing a Comprehensive Research Software Engineer Career Ladder: A Case Study from Princeton University

#12Unequal Treatment & Credit Exclusion in Academia2024 · Research Software Engineers

A white paper submitted to the Heliophysics Decadal Survey documented that Research Software Engineers "receive unequal treatment compared to their science counterparts, including lack of credit for their contributions and insufficient training." Despite being essential to 63% of US researchers who cannot continue their work without software, RSEs are systematically excluded from authorship, grant PI eligibility, and academic recognition systems. The paper advocates for RSEs to receive "equality of contribution" -- equivalent credit, career advancement opportunities, and institutional support as domain scientists -- arguing that the current system exploits technical labor while reserving prestige and funding for traditional research roles.

63% US researchers who cannot continue work without research software

Source: Advocating for Equality of Contribution: The Research Software Engineer (Heliophysics Decadal Survey 2024)

Preservation & Portability

2 evidence items

View issue page

#1Computational Reproducibility Crisis2024-01 · Computational Biologists / Data Scientists

A GigaScience study analyzed 27,271 Jupyter notebooks from 2,660 GitHub repositories associated with 3,467 biomedical publications. Of 10,388 notebooks with successfully installed dependencies, only 1,203 ran without errors -- and just 879 (approximately 5.9%) produced results identical to the originals. The study found that journals including Nature and Nucleic Acids Research had exception rates well above 50%. Dependency management failures, outdated Python versions, and undeclared dependencies represent systemic barriers to computational reproducibility, threatening the integrity of data-driven scientific research.

27,271 Jupyter notebooks analyzed

2,660 GitHub repositories studied

3,467 biomedical publications associated

10,388 notebooks with successfully installed dependencies

1,203 notebooks that ran without errors

5.9% notebooks producing results identical to originals

Source: Computational reproducibility of Jupyter notebooks from biomedical publications

#10ML Reproducibility Barriers & Knowledge Loss2025-01 · ML Engineers / Algorithm Designers

A comprehensive review in AI Magazine identified unique obstacles to reproducibility in machine learning research: sensitivity to training conditions, sources of randomness, inherent nondeterminism, and prohibitive computational costs. The reproducibility crisis is quantifiably severe, with one estimate citing "an annual $200 billion global drain on scientific computing resources" from irreproducible computational work. Unlike traditional science where methods can be independently replicated, ML experiments often cannot be verified because the compute costs are too high, the training data is proprietary, or the random seeds and hyperparameters are undocumented -- eroding the scientific foundation of data-driven research.

$200 billion estimated annual global drain from irreproducible computational work

Source: Reproducibility in machine-learning-based research: Overview, barriers, and drivers

Safety & Harassment

2 evidence items

View issue page

#2Unauthorized Data Scraping for AI Training2024-07 · Algorithm Designers / ML Engineers

A Proof News investigation revealed that Nvidia scraped YouTube videos to train its Cosmos AI model, with VP of Research Ming-Yu Liu writing in an internal email about building "a video data factory that can yield a human lifetime visual experience worth of training data per day." Nvidia used dozens of virtual machines with rotating IP addresses to evade YouTube's detection systems. Companies including Meta, Microsoft, and Nvidia extracted over 15.8 million videos from more than 2 million YouTube channels without creator consent, prompting class-action lawsuits alleging unjust enrichment and unfair competition.

15.8 million YouTube videos extracted without consent

2 million YouTube channels scraped

Source: Nvidia Scrapes YouTube, Eyes Netflix, Discovery to Train New Video Model

#9Training Data Consent & Creator Rights Violation2024-07 · Data Scientists / Algorithm Designers

MIT Technology Review documented how AI companies have systematically scraped training data without consent or compensation, prompting a wave of lawsuits and licensing deals. The New York Times alleges millions of copyrighted articles were used to train AI models without consent. LinkedIn faces a class-action lawsuit for allegedly harvesting private messages for AI training. Reddit sued Perplexity AI for obtaining data "using false identities, proxies and other antisecurity techniques." Meanwhile, data scientists and researchers whose public datasets, code, and analyses were scraped for training have no mechanism for attribution, opt-out, or compensation -- their computational work treated as raw material for corporate AI products.

Source: AI companies are finally being forced to cough up for training data

If you or someone you know is struggling

These are verified live resources for immediate support. If the evidence on this page feels close to home, use one of them before you keep reading.

988 Suicide & Crisis Lifeline

Free, confidential support available 24/7 in the United States.

Call or text 988 Visit website

Crisis Text Line

Free crisis counseling by text, 24/7.

Text HOME to 741741 Visit website

SAMHSA National Helpline

Free, confidential treatment referral and information service, 24/7, in English and Spanish.

Call 1-800-662-HELP (4357)Visit website

Verified against live destinations on April 13, 2026.

How this discipline connects to the wider crisis

The same discipline-level evidence maps cleanly into the site’s issue pages and public policy framing.

Sustainable Income

Micro-payments, opaque splits, and exploitative contract terms that keep creators from earning a living.

Open issue page

Well-being

Burnout, lack of healthcare, mental health crises, and the human cost of creative gig work.

Open issue page

Discovery & Ranking

Algorithmic gatekeeping, pay-to-play promotion, and monopoly control over who gets seen.

Open issue page

Preservation & Portability

Platform lock-in, format obsolescence, and the risk of losing creative work when services shut down.

Open issue page

Safety & Harassment

Online abuse, content theft, deepfakes, and the failure of platforms to protect creators.

Open issue page

Patterns already visible in the source material

These synthesis themes come directly from the niche challenge sheet for this discipline.

Invisible Labor & Structural Non-Recognition

Compute Inequality & Commercialization Without Consent

GPT-4's $100 million training cost and Gemini Ultra's $191 million represent a 287,000x increase since 2017, concentrating frontier AI development in a handful of corporations while academic researchers cannot afford to verify their claims. Nvidia scraped 15.8 million YouTube videos using evasion tactics to train its Cosmos model. Open data mandates push researchers to share freely while corporations harvest that data for proprietary products. The result is a one-directional value extraction pipeline: researchers create, clean, and share data at their own expense; corporations scrape, train, and commercialize it without consent, attribution, or compensation.

Reproducibility Crisis & Burnout Epidemic

Only 5.9% of biomedical Jupyter notebooks produce identical results, and irreproducible computational work drains an estimated $200 billion annually from global scientific computing. ML experiments often cannot be verified due to proprietary data, undocumented parameters, and prohibitive compute costs. Meanwhile, 97% of data engineers report burnout, data scientists average just 1.7 years per job, and 79% of data professionals have considered leaving the field entirely. The people who build the computational foundations of modern science and AI are burning out faster than any institutional response can address, while the work they produce cannot be reliably reproduced or preserved.

Who this evidence already accounts for

These roles and subtypes appear directly in the current discipline sheet.

Data Scientists

Computational Biologists / Data Scientists

Machine Learning Engineers

Included as a documented subtype in the source sheet.

Computational Biologists

Computational Biologists / Data Scientists

Bioinformaticians

Bioinformaticians / Computational Biologists

Research Software Engineers

Algorithm Designers

Algorithm Designers / ML Engineers

Keep exploring the same system from another angle

Sustainable Income

See sustainable income across disciplines.

Well-being

See well-being across disciplines.

Discovery & Ranking

See discovery & ranking across disciplines.

Creators index

Browse every discipline currently represented in the Hub.

The Hub

Return to the evidence library and cross-site context.

Stand with creators

The challenges facing data scientists & computational researchers creators are documented in the evidence above. Sign the declaration to back a better future for creative work.

Sign the Declaration