GitHub engineers Landon Cox and Mara Kiefer published a detailed breakdown in May 2026 of how their team cut token spend on internal agentic CI workflows by up to 62 percent, with the work picked up by InfoQ on May 29. The numbers come from production workflows that already run inside the gh-aw and gh-aw-firewall repositories, including auto-triage, smoke tests, a security guard, and community attribution jobs, not from synthetic benchmarks.
The most surprising part is not the headline percentage. It is that the savings are produced by two agents that audit and rewrite other agents every day, and that the GitHub team is honest enough to publish the workflow where pruning unused MCP tools made no difference at all. That candor, plus a normalized cost metric called Effective Tokens, turns the writeup into the first credible FinOps recipe we have seen for agent workloads in production.
The proxy that finally makes agent spend legible
Different agent frameworks emit logs in different shapes. Claude CLI, Copilot CLI, and Codex CLI all report token usage, but the fields and units do not line up. GitHub solved this by routing every agent call through the API proxy that already sits inside the agentic-workflows security architecture. Each workflow run now drops a token-usage.jsonl artefact with one record per API call, capturing input tokens, output tokens, cache-read tokens, cache-write tokens, model, provider, and timestamps.
That is a small piece of plumbing with outsized consequences. Most engineering organizations today see agent costs as an aggregate line on a monthly Anthropic or OpenAI invoice, with no per-workflow attribution and no way to compare a Claude Sonnet run against a Codex run. The proxy-plus-artefact pattern is reproducible by any team that puts an HTTP proxy in front of its model providers, and it is the prerequisite for everything else in the writeup.
Effective Tokens, the unit that maps to dollars
On top of the raw data, GitHub built a metric called Effective Tokens (ET). The formula is ET = m × (1.0 × I + 0.1 × C + 4.0 × O), where I is newly processed input, C is cache reads, O is output, and m is a model multiplier. Haiku counts as 0.25, Sonnet as 1.0, and Opus as 5.0. Output tokens carry a four times weight because they are the most expensive across providers, and cache reads are discounted to one tenth because they are served cheaply.
The point of ET is portability. A ten percent ET drop corresponds to roughly a ten percent dollar drop regardless of which model the workflow uses. For a CTO trying to compare Bedrock Claude spend against Azure OpenAI spend against a Copilot CLI experiment, that is the kind of normalization that makes governance possible. Workload changes are controlled by tracking LLM API call counts alongside ET, so a constant number of turns with falling tokens-per-call is read as a real efficiency gain rather than a smaller job.
Two agents auditing the agents
Two scheduled jobs sit on top of the artefacts. The Daily Token Usage Auditor aggregates consumption by workflow, flags week-over-week increases, and surfaces the most expensive runs. The Daily Token Optimiser reads the source YAML and recent logs of flagged workflows and opens a GitHub issue with concrete proposals, for example remove these eight MCP tools, or replace this PR diff call with a gh pr diff invocation. Both auditors are themselves agentic workflows, and their own ET shows up in the daily reports, which keeps the optimizers honest about their cost.
Where the 62 percent actually came from
The biggest single lever was replacing in-agent Model Context Protocol calls with deterministic gh CLI invocations. An MCP tool call costs a full LLM round trip for both reasoning and data retrieval. A gh pr diff is a plain HTTP request. GitHub used two strategies: pre-agentic data downloads, where setup steps run gh commands and the agent reads the resulting files, and an in-agent CLI proxy that routes runtime traffic to the GitHub API without exposing auth tokens.
Pruning unused tools from the MCP manifest was the second lever. A GitHub MCP server with 40 tools ships 10 to 15 KB of schema with every turn, and removing the unused entries saved 8 to 12 KB per call with no behavior change. The Auto-Triage Issues workflow fell 62 percent in ET, sustained over 109 post-fix runs. Smoke Claude dropped 59 percent, Security Guard 43 percent, and Daily Community Attribution 37 percent. Glossary Maintainer was found calling search_repositories 342 times in a single run, 58 percent of its tool calls, all unnecessary.
The honest counter-example is Daily Community Attribution. It had eight unused MCP tools with zero calls, and removing them produced no ET reduction because the manifest was a small fraction of total context. A separate Daily Syntax Error Quality job was caught in a 64-turn fallback loop because its sandbox bash allowlist only permitted relative-path globs, blocking every gh aw compile. A one-line allowlist fix eliminated the loop.
What mid-market engineering leaders should do this quarter
We think any team running more than two scheduled agent workflows in production should copy this pattern before the end of the quarter, and the order matters. First, put an HTTP proxy in front of every model provider you use and write a token-usage.jsonl per run, even if you only have one framework today. The proxy is a weekend of engineering and it is what makes the rest possible.
Second, adopt ET or a near clone as the unit your FinOps partner sees. Reporting raw input and output tokens to finance is a dead end once you start mixing Haiku, Sonnet, and GPT-class models in the same pipeline. Third, treat the MCP manifest as a per-turn tax. If your agent is shipping a 40-tool GitHub MCP schema and using six tools, you are paying for the other 34 on every call. The gh-aw repository already ships the auditor and optimiser as installable extensions, so a team running GitHub Actions can be measuring ET this week without writing custom infrastructure.
The decision signal we would draw a line in the sand on: if a vendor selling you an agent platform in 2026 cannot show per-workflow token attribution and a model-normalized cost metric in their console, they are not yet ready for production billing oversight, and a proxy-plus-artefact rig built in-house will give better answers than their roadmap promises.
The next milestone to watch
GitHub flagged portfolio-level analysis as the next step: deduplicating reads across workflows, consolidating overlapping jobs, and caching shared intermediate artefacts across a repository's entire agent fleet. If that ships inside gh-aw before the end of Q3 2026 with public numbers attached, we will treat agentic FinOps as a solved tooling problem for GitHub-hosted shops, and the conversation shifts to whether GitLab and Bitbucket follow. If it slips past Q3 with no portfolio numbers, that is a signal that the savings curve flattens fast once the obvious MCP and CLI wins are taken, and finance teams should budget for agent spend that grows roughly linearly with workflow count rather than betting on further optimizer compounding.



