A Stanford Study Finds Law Professors Prefer AI Answers in Three of Four Blind Comparisons

An Uncomfortable Result Lands in Legal Education

A new Stanford study has produced one of the most striking results yet in the debate over generative AI in higher education. When 16 law professors from 14 US law schools sat down to compare anonymized answers to first-year contracts questions, they preferred the AI-written response in just over 75 percent of cases. Across 2,918 blind comparisons, frontier models posted an average win rate of 75.33 percent against answers written by faculty themselves. Google Gemini 2.5 Pro won 75.92 percent of its matchups and NotebookLM won 74.75 percent. For an audience of CTOs and CIOs who fund education technology and reskilling, this is not an abstract academic curiosity. It is a signal that the quality ceiling on AI tutoring in knowledge-heavy domains has moved faster than most procurement teams assumed.

The framing matters as much as the headline number. Law is not a domain where there is usually a single correct answer, which is precisely why the researchers chose it. They were not testing whether AI could grade a multiple-choice quiz. They were testing whether AI could meet the implicit professional standard that lawyers use to judge one another's reasoning. In a field built on argument, nuance, and contested interpretation, the experts repeatedly judged the machine's reasoning to be the stronger of the two. That is a different and harder bar than fact retrieval, and clearing it reframes what enterprise learning leaders should expect from AI in professional development.

What the Researchers Actually Measured

Julian Nyarko of Stanford Law School, a co-author, did not hide his reaction. We were frankly surprised by the magnitude of the results, he said. These were not just simple questions with obvious answers. Sarath Sanga of Yale Law School, also a co-author, put the design in plain terms. In law, there often is not a single right answer, he said. What we wanted to know is whether AI can meet the latent professional standard that lawyers use to evaluate each other's arguments. In this case, the answer was yes. The professors did not know which answers were human and which were machine, which removes the brand bias that taints many casual comparisons of this kind.

The harm data is the part enterprise risk teams should read twice. AI responses were flagged as potentially harmful in 3.53 percent of comparisons. Instructor-written answers were flagged at 12.06 percent, more than three times the rate. That inverts the usual assumption that human oversight is the safety layer and the model is the liability. It does not mean professors are careless. It more likely reflects that the models were tuned to hedge, qualify, and avoid overreach in ways that read as safer to expert evaluators. For organizations weighing AI in compliance training, legal operations, or regulated client guidance, the lower harm rate is at least as interesting as the higher preference rate.

Read the Limits Before You Read the Hype

The authors were careful to fence in their own finding, and buyers should respect those fences. The study covered brief, written responses in first-year contracts courses. It did not measure longer conversations, multi-turn coaching, student retention, or the downstream effect on critical thinking. A model that writes a better single answer is not the same as a model that teaches a student to reason. Nyarko said as much. Our study evaluates the quality of answers given by AI tools, he noted. But how to implement these tools to most effectively improve student learning is still an open question. That gap between answer quality and learning outcomes is where most edtech deployments succeed or fail.

This is the trap that has swallowed earlier waves of education technology. A tool that demos beautifully in a controlled comparison can still degrade learning if students offload the cognitive work rather than engage with it. The risk is not that the AI is wrong. The risk is that it is so fluent and so often right that learners stop building the muscles the course exists to develop. Procurement that stops at the benchmark and skips the pedagogy will buy a very good answer engine and call it a tutor. Those are not the same product, and the bill for confusing them arrives later.

The Equity Question Hiding in the Data

Stanford's result does not stand alone. It lands in the same week that a separate Google DeepMind trial in Sierra Leone reported strong math gains from AI guided learning, while flagging that stronger students benefited most and weaker ones risked being left behind. Put the two findings together and a pattern emerges. AI can raise the quality of expert-level reasoning and raise measured outcomes, yet it can also widen gaps between learners who already have the skills to interrogate it and those who do not. For institutions and employers chasing equitable upskilling, the headline win rate is the easy part. Distributing the benefit fairly is the hard part.

This is why the implementation question is not a footnote. A high win rate tells you the raw material is good. It tells you nothing about whether your weakest cohort will use it well. Enterprise learning leaders who have watched adaptive-learning pilots stall on exactly this problem should treat the Stanford number as an invitation to design, not a license to deploy. The technology clears the quality bar. Whether it clears the fairness bar depends entirely on the guardrails wrapped around it.

How to Deploy Without Hollowing Out the Skill

The researchers did not just publish a provocative number. They published a deployment recipe, and it is a sober one. They recommend course-based randomized controlled trials before any broad rollout, rather than treating a benchmark as proof. They argue that any classroom or workplace implementation should include clear limits on scope, citations back to course or source materials, refusal mechanisms when the model is uncertain, and explicit escalation routes to a human instructor. That is a governance pattern, not a feature list, and it maps cleanly onto how a serious enterprise would govern any AI system touching regulated work.

For technology executives, the translation is direct. Citations make outputs auditable. Refusal mechanisms reduce the confident-but-wrong failures that erode trust. Escalation routes keep a human accountable for the consequential calls. Randomized trials replace anecdote with evidence before money and reputation are committed. Strip the academic framing and this is simply good AI governance applied to learning. The organizations that win with AI tutoring will be the ones that treat a 75 percent preference rate as the start of a controlled rollout, not the end of the evaluation.

What CIOs and Learning Leaders Should Take Away

The durable lesson is not that AI beat law professors. It is that in a domain defined by judgment rather than facts, expert evaluators preferred the machine three times out of four and flagged it as harmful far less often than they flagged themselves. That combination should reset expectations for what AI can contribute to professional education, legal operations, and high-stakes knowledge work. The capability is no longer the bottleneck. The bottleneck is design, governance, and the discipline to measure learning rather than admire answers.

We would caution against both reflexive dismissal and uncritical adoption. The professors who ran this study are themselves urging restraint and randomized trials before scaling, which is the right instinct. Treat the result as a credible upper bound on answer quality and a clear lower bound on the governance work required to capture it safely. The institutions and enterprises that internalize that distinction will turn a striking benchmark into durable advantage. The ones that stop at the headline will buy a very persuasive tool and learn the hard way that persuasion is not the same as pedagogy.

An Uncomfortable Result Lands in Legal Education

What the Researchers Actually Measured

Read the Limits Before You Read the Hype

The Equity Question Hiding in the Data

How to Deploy Without Hollowing Out the Skill

What CIOs and Learning Leaders Should Take Away

OpenAI Opens Founder Day Singapore to Train Southeast Asia's Startup Builders on Agentic AI and Codex

Axiometa and Anthropic Open a London Hackathon to Train Developers on Embodied AI Beyond the Browser

Block Folds 450 JVM Repositories Into One Monorepo to Kill Dependency Drift