How Do You Measure Developer Productivity in an AI-Augmented World?

The news keeps getting worse

Meta is tracking developer AI tool usage. So is Spotify, allegedly. Over the last few weeks the stories have piled up: agent runs, session counts, raw tokens burnt through Copilot and Cursor and Claude Code. The framing in every boardroom goes the same way. “AI is expensive. We should know who is getting value out of it.” Token consumption becomes the proxy. And off we go.

Here is what I keep coming back to. If the biggest, best-funded engineering orgs on the planet are still fumbling this question in public, there is no playbook to copy. Anyone selling you a “best practice” for measuring AI-augmented productivity this quarter is guessing. I am guessing too. I am just trying to guess more honestly.

Tokens are an input. Engineering is an output business.

A crowded 80s sales bullpen with a giant neon scoreboard on the wall reading CALLS, while one desk sits empty in the foreground, a visual metaphor for rewarding activity over outcomes.

The core problem with token tracking is simple: it measures an input, and engineering has always been an output business.

Imagine a sales team graded on calls dialled instead of deals closed. You get a phone bank that never sells. Imagine a gym member grading themselves on hours spent on the treadmill instead of body composition. You get tired, not fit. Token maxxing is the same failure mode, aimed at engineering, and every generation of engineering metrics has hit this wall.

Lines of code rewarded verbosity and punished refactors that deleted files.
Commits per day rewarded mashing the enter key and punished anyone who batched work cleanly.
PR count rewarded fragmenting work and punished anyone who shipped large coherent changes.
Tokens consumed rewards prompt padding, verbose completions, and asking the AI for things you already know.

What matters is what shipped, whether it stayed up, and whether customers noticed. That has always been true. AI does not change the math. It just makes it more tempting to measure the wrong thing, because the wrong thing is finally easy to count.

DORA is still the right starting point

A small engineering team gathered around a large monitor in a bright modern office, looking at a clean production dashboard showing DORA style panels: deployment frequency, lead time, change failure rate, and an MTTR timeline.

Here is the good news. We already have a measurement framework that focuses on the right thing, and it has almost a decade of research behind it. DORA.

Four metrics, no more:

Deployment frequency. How often does this team ship to production?
Lead time for changes. From commit to production, how long?
Change failure rate. What percentage of deployments cause an incident, rollback, or hotfix?
Mean time to recovery (MTTR). When something breaks, how fast do you get it back?

Two of these (frequency and lead time) measure throughput. Two (failure rate and MTTR) measure quality and resilience. A team that is genuinely getting better will move all four in the right direction at once. A team that is gaming one of them will move that one and break the others.

In an AI-augmented world, DORA becomes more important, not less. It is the only framework I know that cannot be gamed by generating more text.

How AI changes the DORA picture

The honest question is whether AI actually helps DORA move. Here is what I would expect to see if it does:

Deploy frequency should go up. If AI is genuinely unblocking people, they ship more often.
Lead time should shrink. Idea to production gets shorter when boilerplate and review friction drop.
Change failure rate is the canary. This is where AI slop shows up first. Generated code that compiles but misunderstands the domain lands in production and breaks something subtle. Watch this number like a hawk after you roll out any coding agent.
MTTR is the comprehension test. When an incident fires at 3am, can the on-call actually fix what shipped, or do they have to ask the AI what the AI wrote last Tuesday? Rising MTTR on AI-touched code is comprehension debt, and comprehension debt compounds.

If deploy frequency climbs while change failure rate and MTTR climb with it, you are shipping bugs faster. Nothing more.

Top five things I would watch out for

A lone on-call engineer at 3am at a cluttered home desk lit by two monitors. One shows a red incident alert, the other an AI coding assistant chat window, the tense moment where MTTR on AI touched code actually gets tested.

There are no best practices yet. But here is what I would actually put on the wall if I were running an engineering org today.

DORA, all four, weekly. No partial credit. If you only track deploy frequency you will cheer the wrong thing. The four-as-a-set is the whole point.
Change failure rate as your leading indicator of AI slop. The moment this starts creeping up after an AI rollout, pause and look at what landed. Do not wait for a public incident to catch it.
MTTR on AI-touched code, specifically. Tag commits that were materially AI-assisted. Watch their recovery time separately. If it is worse than your baseline, your team does not understand their own codebase well enough to maintain it. That is the single most dangerous quiet failure mode of AI-augmented engineering.
Leverage, not volume, in 1:1s. Ask every engineer where AI genuinely unblocked them this week. Three specific answers will teach you more than any dashboard. If nobody can name a moment, the tools are not landing.
Team health signals as the tiebreaker. Retention, eNPS, burnout. A team hitting all four DORA metrics while losing its best people is telling you the metric is lying about the thing you actually care about. Listen to that, not the graph.

Token counts are not on this list. That is deliberate. Aggregate token spend belongs in finance for budget planning. It does not belong in an engineer's performance review. Ever.

There is no right answer yet, and that is the point

A staff engineer alone in a quiet glass walled office, standing in front of a dense distributed systems whiteboard sketch, holding a marker and thinking, the kind of quiet high leverage work that never shows up on a token dashboard.

Take comfort in this. The biggest engineering orgs on the planet are fumbling this question in public. There is no playbook. There is no “best practice” you missed. Anyone who walks into your office this quarter with a confident framework for measuring AI productivity is either selling something or has not shipped software in a while.

What honest engineering leaders are doing is the unglamorous thing: grounding in DORA, adding a couple of new signals where they make sense, watching the team carefully, and adjusting when reality disagrees with the dashboard. Measure. Adjust. Measure again. That is the work.

Takeaway

If you are a CTO, VP of Engineering, or team lead and somebody is about to roll out an AI productivity dashboard at your company, do three things this week:

Ground your org in DORA first. If you are not already tracking all four metrics, get there before you touch any AI-specific signal. Everything else is decoration.
Add change failure rate and MTTR on AI-touched code as your early warning lights. These two will tell you faster than anything else whether AI is a genuine multiplier or a beautifully disguised bug factory.
Never use token counts for individual evaluation. Budget and procurement, yes. Reviews and bonuses, no. Put that line in writing before anyone is tempted to cross it.

The point of measurement is to answer one question. Is AI actually making your team better at shipping software? If your dashboard cannot answer that in a sentence, your dashboard is measuring the wrong thing.