NVIDIA Blackwell Sweeps MLPerf Training 6.0 as Azure Trains Llama 405B in Seven Minutes

Benchmarks that redraw the frontier

The latest round of MLPerf Training results, version 6.0, offers a precise snapshot of how fast the frontier of AI training infrastructure is moving, and the numbers are difficult to internalize. NVIDIA's Blackwell platform delivered the fastest training times across all seven benchmarks in the suite and was the only platform to submit results in every category. The headline figure belongs to Microsoft Azure, which reached the quality target for Llama 3.1 with 405 billion parameters in 7.07 minutes, using 8,192 GPUs spread across 128 racks. Training a model of that size to a target quality in roughly seven minutes would have seemed implausible only a couple of years ago.

These benchmarks matter beyond bragging rights because they define what is economically feasible. The time and cost to train a model determine how often an organization can iterate, how many experiments it can run, and how quickly it can bring a new model to market. When training times collapse, the entire rhythm of AI development accelerates. A team that can train a frontier scale model in minutes rather than days can explore far more of the design space, and that capacity to iterate is increasingly the real competitive advantage in frontier AI, more than any single architectural insight.

Scale and efficiency together

The results showcase two distinct kinds of progress that often trade off against each other but here advanced together. The first is raw scale. The largest Blackwell cluster submission used 8,192 GPUs working in concert, a feat of coordination that is itself a significant engineering achievement, because keeping that many processors synchronized and fed with data without bottlenecks is extraordinarily hard. The second is per chip efficiency. The GB300 NVL72 system delivered up to 1.6 times the performance of the prior generation GB200 NVL72, meaning each unit of hardware did substantially more work than before.

Achieving both at once is the difficult part. It is comparatively easy to go faster by simply adding more hardware, but that path runs into power, cost, and coordination limits. It is also possible to improve efficiency on small workloads that do not test the limits of coordination. Delivering more performance per chip while simultaneously scaling to thousands of chips is the combination that actually moves the economics of training. CoreWeave's result drives the point home, training the 671 billion parameter DeepSeek-V3 model to target quality in 2.02 minutes on the GB300 NVL72, a demonstration of both scale and efficiency in a single run.

New workloads reflect a changing field

MLPerf Training 6.0 also updated what it measures, and the changes are revealing. The suite added two new mixture of experts workloads, DeepSeek-V3 with 671 billion parameters and GPT-OSS-20B. Mixture of experts architectures, which activate only a subset of a model's parameters for any given input, have become central to how the most capable models are built, because they allow enormous total parameter counts without a proportional increase in computation per token. That the benchmark added these workloads reflects their dominance in current frontier model design.

The technology underneath the results points to where infrastructure is heading. NVIDIA highlighted NVFP4, a low precision training format that reduces the computational cost of each operation, fifth generation NVLink switches that unify 72 GPUs into a tightly coupled unit, and NVRx, a resiliency extension that handles checkpoint based recovery when hardware fails at scale. That last item is easy to overlook but increasingly essential. When training runs span thousands of GPUs for extended periods, individual component failures are not rare events but statistical certainties, and the ability to recover gracefully rather than restart from scratch is what makes massive scale training practical at all.

What it means for everyone else

It is fair to ask what these frontier numbers mean for the vast majority of organizations that will never train a 405 billion parameter model from scratch. The answer is that the frontier sets the trajectory for everything below it. The efficiency gains demonstrated at the top of the market propagate downward, lowering the cost of training and fine tuning the smaller, specialized models that most enterprises actually deploy. The hardware and software advances that let Azure train Llama 405B in seven minutes are the same advances that make it cheaper for an enterprise to fine tune a domain specific model next year.

There is also a strategic signal in who posted these results. Microsoft Azure and CoreWeave, a cloud provider and an AI focused neocloud, are the ones demonstrating frontier scale training, not just NVIDIA in isolation. This underscores that the ability to train at the frontier is increasingly a cloud capability, accessible by renting rather than building. For most enterprises, that is the relevant lesson. They will consume this performance through cloud platforms rather than owning the hardware, and the competition among providers to offer the fastest, most cost effective training will keep pushing the price of AI development down.

The reliability story underneath

The most underappreciated theme in these results is reliability at scale. Training across thousands of GPUs for any meaningful duration guarantees that some hardware will fail mid run. Without mechanisms to handle that gracefully, a single component failure can waste enormous amounts of compute by forcing a restart. NVIDIA's emphasis on resiliency through NVRx and checkpoint based recovery reflects a maturing understanding that frontier training is as much a reliability engineering problem as a raw performance one. The fastest hardware in the world is worthless if a run cannot survive the inevitable failures along the way.

This is the kind of unglamorous capability that separates demonstrations from production. As organizations scale their AI training ambitions, they will discover that coordination, fault tolerance, and recovery matter as much as peak throughput. NVIDIA summarized its own pitch in terms that capture the integrated nature of the challenge, describing a single platform engineered through extreme codesign to enable AI model builders to launch frontier models faster while minimizing training costs. Whether one accepts the marketing framing or not, the underlying point holds. At this scale, performance, efficiency, and reliability are not separate concerns but a single engineering problem, and the MLPerf 6.0 results show how far the integrated solution has come.

A measuring stick worth watching

MLPerf has earned its influence by being a standardized, audited benchmark in a field awash with selective and unverifiable performance claims. When a vendor reports results to MLPerf, those results are run under common rules and open to scrutiny, which makes them far more credible than the cherry picked figures that fill product announcements. For technology leaders trying to cut through AI infrastructure marketing, MLPerf remains one of the few reliable measuring sticks, and its periodic updates are a useful pulse check on the real, verifiable state of the art.

The trajectory these results trace is unmistakable. Training that once took days now takes minutes, efficiency and scale are advancing together, and the capability is increasingly delivered through cloud providers rather than confined to those who build their own infrastructure. For enterprises, the practical implication is that the cost of developing and customizing AI models will keep falling, expanding what is economically feasible. The frontier demonstrated in MLPerf 6.0 is not where most organizations operate, but it is the leading edge of a curve that everyone else is riding, and it is bending downward in cost as fast as it is bending upward in capability.

Benchmarks that redraw the frontier

Scale and efficiency together

New workloads reflect a changing field

What it means for everyone else

The reliability story underneath

A measuring stick worth watching

University of Leicester Pushes AI Literacy Into the Core Curriculum, Tying Reflection to Critical Thinking

Databricks Pitches LTAP as a New Data Foundation Built for AI Agents

Databricks Launches Agent Bricks and Unity AI Gateway to Govern the Enterprise Agent Stack