OpenAI's LifeSciBench Shows the Best AI Models Still Fail Two Thirds of Real Research Tasks
AI & ML

OpenAI's LifeSciBench Shows the Best AI Models Still Fail Two Thirds of Real Research Tasks

OpenAI's new 750-task life sciences benchmark, built with 173 scientists, reveals that even its leading model passes just 36 percent of realistic research tasks, and collapses on real scientific artifacts.

PublishedJune 17, 2026
Read time6 min read
Share

OpenAI Reframes the Life Sciences Question

OpenAI used to sell biology benchmarks the way every lab does, as trivia. Can the model name the pathway, recite the mechanism, recall the gene. On June 17 the company published LifeSciBench and quietly retired that framing. The benchmark, built with 173 scientists drawn from biotechnology and pharmaceutical research, contains 750 expert-authored tasks across seven workflows and seven biological domains. It does not ask whether a model knows biology. It asks whether a model can do the work: interpret incomplete evidence, reconcile conflicting results, design an experiment, troubleshoot a failing assay, and then communicate the decision to a research team that will spend real money acting on it.

That distinction matters more than it sounds. OpenAI's own framing is blunt: "Most biology benchmarks ask narrow, fact-based questions with clean answers. Scientists weigh imperfect evidence and make decisions." For the executives funding AI-for-science programs, this is the gap between a demo that dazzles and a deployment that survives contact with a wet lab. LifeSciBench is the first widely published attempt from a frontier lab to measure that survival rate, and the numbers it produces are sobering enough to reset expectations.

The Scores Are Lower Than the Hype

The headline result is that the strongest model, OpenAI's own GPT-Rosalind, passes only 36.1 percent of tasks, with a normalized score of 0.576. GPT-5.5 trails at 25.7 percent and 0.519, Gemini 3.1 Pro at 23.6 percent, GPT-5.4 at 20.7 percent, and Grok 4.3 at just 13.0 percent. Put plainly, the best model on the market fails nearly two-thirds of the realistic research tasks a working scientist would hand it, and 22.8 percent of tasks were failed by every model evaluated. No model cleared 22.8 percent across all categories simultaneously.

We read these numbers as a useful corrective. The pattern of recent frontier launches has trained buyers to expect benchmark saturation, scores in the high eighties and nineties that make every model look interchangeable. LifeSciBench breaks that pattern precisely because it is hard to game. Grading is unforgiving: as OpenAI puts it, "A response can collect partial credit yet still fail the task." Each task is scored against an expert rubric, 19,020 criteria in total, roughly 25 per task. A model can be mostly right and still be useless to the scientist who has to commit reagents and weeks of bench time to its conclusion.

Artifacts Are Where Models Break

The most actionable finding for enterprise buyers is buried in the artifact results. LifeSciBench attaches 1,062 real research artifacts to its tasks: figures, PDFs, data tables, sequence files, protein structures, chemical files, and web references. When models had to reason over those artifacts rather than clean text, performance collapsed. GPT-Rosalind dropped from 45.1 percent on text-only tasks to 28.1 percent on artifact tasks, a 17-point fall. Roughly 79 percent of all tasks require multiple reasoning steps, averaging four steps each, and the artifact-heavy ones are where the chains break down.

This is the detail that should shape procurement. Real scientific work is not a chat transcript; it is a folder of messy files, half-labeled gels, and a spreadsheet someone exported wrong. A model that scores well on text Q and A but stumbles on a sequence file or a structure render is a model that will disappoint in the lab. The benchmark effectively tells CIOs and chief scientific officers what to test before signing: not the model's knowledge, but its ability to ingest and reason over the actual artifacts their researchers produce every day.

How the Benchmark Was Built

The construction methodology is unusually rigorous, and that rigor is part of the pitch. Tasks were authored by scientists with industry experience and then validated by 453 expert reviewers, predominantly PhDs. Each task went through an average of six revisions plus two rounds of expert review requiring 90 percent agreement before inclusion. The seven workflow categories, evidence handling, analysis, design and optimization, scientific reasoning, validation and operations, translation, and scientific communication, were chosen to mirror the actual arc of a research project rather than isolated capabilities.

That end-to-end view is what separates LifeSciBench from the narrow probes that dominate AI evaluation. OpenAI frames the tasks as testing whether "models can reason from evidence, work with scientific artifacts, handle uncertainty, and make useful decisions under real-world constraints." The scale of the rubric, nearly 20,000 individual scoring criteria, means the benchmark is expensive to produce and hard to overfit. For a field plagued by contaminated test sets and leaked answers, an expert-judged, artifact-grounded benchmark is a credibility asset as much as a measurement tool.

Why This Is a Competitive Move, Not Just Science

It is worth being clear-eyed about why OpenAI is publishing this now. GPT-Rosalind, its life sciences model, tops the leaderboard by a comfortable margin and uses fewer tokens than GPT-5.5 to do it. A company does not build a 750-task benchmark, score its own model first, and publish the results out of pure altruism. LifeSciBench is simultaneously a genuine scientific contribution and a marketing instrument, a way to define the terrain on which OpenAI's vertical model wins. The benchmark's design rewards exactly the capabilities GPT-Rosalind was tuned for.

We do not think that undermines its value, but buyers should hold both ideas at once. A vendor-authored benchmark that the vendor's model leads is informative about capability ceilings and failure modes, and self-serving about rankings. The smart response for enterprises is to treat LifeSciBench as a template, adopt its artifact-grounded, rubric-scored methodology, and re-run it on their own proprietary tasks with their own data. The benchmark's real gift to the industry may be the evaluation pattern, not the leaderboard.

What CIOs and Research Leaders Should Take Away

For technology executives steering AI-for-science investments, LifeSciBench delivers three concrete signals. First, calibrate expectations: even the best model fails most realistic research tasks, so AI belongs in an augmentation role with expert review, not autonomous decision-making, in any high-stakes scientific workflow. Second, test on artifacts, because the gap between text performance and artifact performance is the gap between a successful pilot and a failed deployment. Third, demand rubric-based, expert-judged evaluation from any vendor pitching a life sciences model, and be skeptical of saturated scores.

The broader lesson reaches past biology. LifeSciBench is a model for how to evaluate AI in any complex, regulated, artifact-heavy domain, finance, law, engineering, clinical operations, where the work is messy and the cost of a confident wrong answer is high. The labs that win enterprise trust in 2026 will not be the ones with the highest score on a clean benchmark. They will be the ones honest enough to publish how badly current models fail at the real job, and specific enough to show where. OpenAI just set that bar.

Tagged#news#ai-ml#ai#openai#llm#benchmarks