Evidence Where There Has Been Mostly Hype
Google DeepMind and Fab AI released results from a preregistered randomized controlled trial showing that students who used Gemini Guided Learning during teacher-led math lessons scored significantly higher than peers receiving standard instruction. The study involved 1,763 Grade 7 and Grade 8 students across 48 classrooms in 12 government-supported junior secondary schools in Port Loko District, Sierra Leone. After years of breathless claims about AI transforming education, here is something the field has badly needed: rigorous, preregistered evidence.
We cannot overstate how unusual this methodological seriousness is in edtech. The sector is awash in vendor case studies, self-selected success stories, and pilots designed to flatter the product. A preregistered randomized controlled trial, where the hypotheses and analysis plan are committed in advance to prevent cherry-picking results, is the gold standard of evidence and almost never applied to AI education tools. That DeepMind chose this rigor, and published the results, sets a standard the rest of the industry should be measured against. The method matters as much as the finding.
What the Numbers Show
The results are encouraging and honestly reported. The intent-to-treat effect was 0.258 standard deviations, with a treatment-on-treated effect of 0.380 standard deviations among classrooms that completed the full dosage. For context, effect sizes in education research are notoriously modest, and an intent-to-treat effect above 0.25 standard deviations from a classroom intervention is genuinely meaningful. The larger effect among classrooms that used the tool as intended is exactly the pattern a credible result should show.
The intervention design was disciplined. It ran from October to December 2025, with teachers asked to use Guided Learning in two of four weekly math periods, targeting roughly 12 hours over eight weeks. We appreciate that the researchers reported both the intent-to-treat and treatment-on-treated figures rather than only the more flattering number. The gap between them is informative in itself: it tells us the tool works when actually used, and it quietly highlights that implementation fidelity, getting teachers to use the intervention as designed, is where much of the real-world variation in outcomes will come from.
Guided Learning, Not Answer Vending
The most reassuring findings concern how students actually used the AI, and they directly counter the deepest fear about AI in education. Analysis of 113,344 messages across 7,421 conversations found that 97.4 percent of student messages were on-topic and only 2.1 percent sought direct solutions, while 76.4 percent of AI responses were scaffolding questions. In other words, students used the tool to learn rather than to cheat, and the AI guided rather than simply handed over answers.
This is the crux of the matter. The pervasive worry about AI in classrooms is that it becomes a sophisticated cheating machine, short-circuiting the productive struggle that genuine learning requires. The data here suggests that a tool deliberately designed for guided learning, one that asks scaffolding questions instead of dispensing solutions, can avoid that trap. The design intent was realized in practice: the AI predominantly responded with questions that pushed students to think, and students overwhelmingly engaged with the material rather than gaming the system. The pedagogy was built into the product, and it held up under real classroom conditions.
A Sociotechnical Intervention, Not a Gadget Drop
DeepMind's research director, Irina Jurenka, was careful to frame the result correctly: "This wasn't just about dropping the AI into the classrooms, it was a sociotechnical intervention." That distinction is essential and too often ignored. All teachers received five to six hours of training, and the partners included Oxford MeasurEd, Laterite, EducAid, and the Sierra Leone Ministry of Basic and Senior Secondary Education. The success was not the AI alone; it was the AI embedded in trained teaching, institutional support, and careful implementation.
We want to underline this because it is the lesson most likely to be lost in translation. Technology dropped into classrooms without teacher training, institutional buy-in, and thoughtful integration routinely fails, and the history of education technology is a graveyard of gadgets that promised transformation and delivered nothing. The Sierra Leone result worked because it treated the human and institutional context as integral to the intervention. One participating teacher offered the most human evidence of all: "the introduction of AI, I mean, let me confess, I've seen children rushing to attend classes." Engagement, properly cultivated, is itself a meaningful outcome.
The Start of a Real Evidence Base
DeepMind describes this as the first in a planned global portfolio of preregistered trials on AI's effect on teaching and learning, and that commitment may matter more than this single study. One trial in one district, however rigorous, cannot answer whether AI helps students learn across the wildly varied contexts of global education. A sustained program of preregistered trials across different settings, subjects, and age groups is what could actually build a credible evidence base where almost none exists today.
We would hold this work to its own high standard going forward. The Sierra Leone result is genuinely promising, and the methodological seriousness is exactly what the field has lacked. But the honest conclusion is that this is the beginning of an evidence base, not a verdict. The questions that remain, whether the effects persist, whether they replicate in other contexts, whether they hold at scale without intensive support, are precisely the ones the planned portfolio must answer. For now, the field has something rare and valuable: rigorous evidence that, under the right conditions, AI can help children learn. That is worth celebrating, carefully.



