AI Cost Is an Operating Discipline

Jun 28, 2026

Disclaimers upfront: I’ve relied heavily on Codex when writing this and I’ll continue doing so throughout the writing. The outputs reflect what themes are top of mind for me at any given time + I’m heavily vested in getting a grip how AI generated output is of good quality.

This is a personal publication. Views expressed below are my own.

In March, Uber’s CTO said the company had already blown through its AI budget for 2026. The year was barely underway. That is what unmanaged AI adoption looks like once usage spreads beyond a few experiments. Cost starts moving with context design, model choice, routing rules, approval loops, and how casually people let agents run in the background. It changes while the work is happening, long before the invoice gives anyone a clean number to debate cost against value.

The usual finance-cleanup frame lands too late. By the time a leadership team studies the invoice, the real decisions have already been made inside the product and engineering system and rework will not be easy. A default model was set too high. An agent loop ran longer than anyone expected. A frontier model kept handling low-stakes tasks that a cheaper model, a deterministic workflow, or a human step could have handled. The bill is the residue.

Recent reporting and operator commentary from Pragmatic Engineer, Mostly Growth, Fintech Brainfood, Charity Majors, AI Daily Brief, and Tomasz Tunguz all point to the same operating shift. Teams do not get AI costs under control by auditing usage after rollout. They get control by designing observability, routing, evals, and cost-per-outcome judgment into the system before spend spreads.

Treating AI cost like seat software with a bad surprise attached hides the change in usage. Seat logic fits a median company spending about $11.38 per employee per month on AI, as reported in TechCrunch’s summary of Ramp data. It stops fitting the heavy users. In the same dataset, the top 10% spend about $611 per employee each month, and the top 1% spend $7,500 per employee each month. Those firms are running an expensive production input with weak line of sight to output unless they build one deliberately.

A common failure pattern is easy to recognize. A team adopts coding agents or support agents quickly. Usage rises faster than review habits, routing logic, and internal norms. FinOps and deliberate engineering practices lag because the first phase feels like experimentation. Foundations start to matter when usage becomes normal work. Leadership then asks the question the system should have answered already: what useful work did this spend produce?

Pragmatic Engineer captures that question directly through Uber’s internal tradeoff.

Praveen Neppalli in The Pulse at Pragmatic Engineer:

We’re going to have to start talking about token consumption and the associated cost versus headcount, and making tradesoffs on that as an engineering organization.

AI cost is just an engineering-management resource. Capital allocation all the way down. Once token use becomes large enough to compete with other resource choices, teams need the same discipline they built around cloud infrastructure over time: visibility by team and use case, budget envelopes, escalation paths, and architecture choices that keep the expensive path narrow. Mostly Growth makes the cloud analogy through operator stories: one company describes people “burning through tokens to do stuff,” another reports Claude costs jumping 45% month over month, and another says it is on pace to spend $4 million on tokens this year. Those usage patterns do not get solved through end-of-quarter spreadsheet-induced panic discussions.

But again, raw usage tells the team almost nothing by itself. Simon Taylor’s equation in The Token Economy strips away the seat-buying illusion. Tokens are only fuel. The organization has to turn that fuel into shipped work.

Simon Taylor in The Token Economy:

Outcome = Tokens × Intelligent Operating Model

The formula moves the discussion from usage to conversion. What stands between token spend and a useful product outcome? Intelligent Operating Model is the multiplier. 0 is also a number. Sometimes the answer is obvious: a coding assistant helps a developer ship faster, a support workflow closes tickets sooner, a fraud review system cuts analyst time. Once usage gets broad, the missing layer becomes painful. Teams can see the token graph. They cannot see which tokens became a shipped feature, an onboarded customer, a smaller support queue, or incremental revenue.

Taylor argues that observability is the first management problem. His proposed metrics are blunt but directionally right: tokens burned per shipped feature, token cost per customer onboarded, token cost per dollar of incremental revenue. None of those metrics are universal. All of them are hard to measure. They are better than counting seats, prompts, or agent invocations as if activity itself were proof of value. Like lines of code. Or number of PRs.

Many teams misread the moment. They assume the mature response is austerity: lower the budget, negotiate vendor discounts, block a few tools, and move on. That response can look especially tempting after guidance swings from maximum experimentation to sudden restraint. Some cost cutting will happen, but it is incomplete. If the system cannot distinguish high-value expensive work from low-value expensive work, cuts will land in the dark. The team will either preserve wasteful paths because they look sophisticated or kill useful ones because they look expensive.

The engineering mechanism sits one layer deeper. Charity Majors explains why AI cost control is one facet of software design. When code generation becomes cheap, the durable work moves into knowing what behavior is required, what invariants must hold, how to validate output, and how to observe what the system is doing in production.

Charity Majors in AI demands more engineering discipline. Not less:

Those are not code problems. They are evaluation problems.

Token spend becomes hard to manage for the same reason agent-written code becomes hard to trust: the expensive failures do not start in the invoice. They start in missing constraints, missing traces, weak evals, unclear service levels, and architecture that lets the costly path become the default path. If regeneration is easy, teams can produce more prompts, more code, and more agent actions than they can responsibly validate. The bill rises first. The evidence arrives later, if it arrives at all.

Majors puts production inside development, not after it. AI cost discipline needs the same assumption. The team cannot wait for a monthly spend review to discover that a prompt template change tripled cost, that a cheap model quietly degraded an important workflow, or that an expensive reasoning model is now handling routine requests. Those conditions belong in the operating loop: trace, inspect, compare, adjust, release, monitor again.

Underpriced usage masked bad operating habits. AI Daily Brief argues that subscription logic built for chatbot usage is straining under coding agents and longer-running workloads that consume materially more tokens. Tunguz shows a related pattern from the vendor side: flagship prices fell, then started to diverge or rise again as strategy shifted from share capture toward margin discipline. Rising token consumption turns that pricing shift into an operating pressure for buyers.

Those economics change the job of architecture. Model choice becomes part of the product surface. AI Daily Brief puts it in practical terms with “intelligence per unit of cost,” cheap-model bake-offs, escape-hatch architecture, and a cost scoreboard. The team has to know where high reasoning quality is worth paying for, where a cheaper model is good enough, where deterministic automation beats both, and where a human fallback should remain in the loop.

Mostly Growth shows the operating response: teams negotiate across providers, set cheaper models as defaults, and replace AI calls with Zapier or n8n flows when the task is deterministic. Cost-aware system design makes room for both AI and ordinary automation.

Nate B. Jones’s harness frame gives that operating response a system boundary. A token is raw intelligence. The harness turns that raw intelligence into work through files, tools, permissions, memory, evals, routing, and workflow definition. Cost discipline belongs in the layer that decides which model gets called, with what context, under what limits, for what standard of done, and with what evidence attached.

The team needs a dashboard, but the dashboard cannot stop at total spend. It needs spend by team, by use case, by workflow, and by service level. It needs to show which workflows are allowed to use frontier reasoning, which ones default to cheaper models, and which ones should leave the model path entirely when determinism is available. It needs monthly limits and overage justification for broad internal tools, as Pragmatic Engineer reports DoorDash using. It needs eval thresholds so a model-routing change can lower cost without quietly lowering product quality. It needs a fallback path when the expensive model is unavailable, and coding-heavy workflows need enough tests, traces, and runbooks to stay diagnosable if the frontier agent is offline.

An AI feature spec cannot stop at user intent and response quality. It has to say what resource envelope the feature is allowed to consume, what latency the user will tolerate, what failure mode is acceptable, and what outcome metric will justify keeping the feature alive. The spec also has to travel with the release: routing rules, eval thresholds, and fallback conditions need to change when production behavior changes. Otherwise the team launches an unbounded cost center and calls it experimentation.

Cost discipline changes repeated work. PMs, engineers, platform owners, and finance partners need a shared review loop. A customer signal becomes a proposed workflow. The workflow gets a target service level, a model-routing rule, an eval, a budget envelope, and a fallback. Production traces show how it behaves. If spend rises without a matching improvement in user value, the team changes the route, narrows the use case, downgrades the model path, or deletes the workflow. Roadmap and sprint rituals built for scarce delivery capacity do not decide which workflows deserve expensive reasoning, which should be simplified, and which should be removed. The decision surface is continuous.

The loop also changes where trust attaches. Trust should not attach to token volume, the prestige of the model, or the number of teams using the tool. It should attach to the evidence path from spend to durable behavior change. Can the team show that a given workflow improved onboarding throughput, reduced support time, or shipped a useful feature sooner without breaking quality, latency, or reliability constraints? If it cannot, a high bill is only one symptom. The system cannot explain itself.

Companywide usage turns the same problem into portfolio work. In The AI Portfolio Dilemma, one operator describes enterprises carrying lists of 100 or 200 AI use cases without strong cost accounting or lifecycle discipline. That is how shadow spend returns in a new form. A use case enters the portfolio as a cheap subscription idea and becomes a much larger token bill once real usage lands. A workflow survives because nobody owns decommissioning it. An architecture choice locks in a costly provider path that later proves unnecessary. Those are portfolio problems, but they start in the operating model.

Not every team needs the full machinery today. A low-usage team experimenting with a small number of internal workflows can stay lighter for a while. The controls need to appear before uncontrolled habits become the product. Once teams normalize vague ROI stories, frontier-by-default routing, and poor visibility into production behavior, the later cleanup will be slow and political.

AI cost has become a mature engineering subject for the same reason cloud spend did. The resource is elastic, the local decisions are easy, and the aggregate consequences arrive late. Teams that treat the bill as a downstream finance issue will spend their time explaining surprises. Teams that treat cost as part of architecture, harness design, eval coverage, and product judgment can decide where expensive intelligence belongs and where it does not. When usage rises and subsidies weaken, every token has to justify itself.

References

Charity Majors, “AI demands more engineering discipline. Not less,” June 15, 2026 —

charity.wtf

AI demands more engineering discipline. Not less

A few days back I wrote a piece called “AI enthusiasts are in a race against time, AI skeptics are in a race against entropy…

21 days ago · 181 likes · 52 comments · Charity Majors

Gergely Orosz, “The Pulse: a trend of trying to cut back on AI spend within eng departments?,” June 11, 2026 — https://newsletter.pragmaticengineer.com/p/the-pulse-a-trend-of-trying-to-cut-328
Simon Taylor, “The Token Economy: Tokenmaxxing Is Stupid Until It Isn’t,” May 10, 2026 — https://www.fintechbrainfood.com/p/the-token-economy
Kyle Wiggers, “’AI-pilled’ firms spend $7,500 per employee each month on AI,” June 10, 2026 — https://techcrunch.com/2026/06/10/ai-pilled-firms-spend-7500-per-employee-each-month-on-ai/
Mostly Growth, “AI Spend Is the New Cloud Bill,” June 24, 2026 —

Nate B. Jones, “The Harness Is the Business: Inside the OpenAI and Anthropic IPO Bet,” June 15, 2026 — https://shows.acast.com/ai-news-strategy-daily-with-nate-b-jones/episodes/the-harness-is-the-business-inside-the-openai-and-anthropic
AI Daily Brief, “The AI Subsidy Era is Over,” April 28, 2026 https://aidailybrief.beehiiv.com/p/the-ai-subsidy-era-is-over
Tomasz Tunguz, “The Unsustainable Subsidy,” May 20, 2026 — https://tomtunguz.com/ai-model-inflation/
CDO Matters, “The AI Portfolio Dilemma: Why Most Companies Can’t Account for What They’re Spending,” June 26, 2026 —

AI Operating Systems by Mart Roosimägi

Discussion about this post

Ready for more?