The New Job Is Designing the System

Jun 07, 2026

I’ve restarted writing here in Substack. Different professional path now, different articles. I’ve relied heavily on Codex when writing this and I’ll continue doing so throughout the writing. The outputs reflect what themes are top of mind for me at any given time + I’m heavily vested in getting a grip how AI generated output is of good quality.

This is a personal publication. Views expressed below are my own.

Miles Palmer learned to code because he wanted to stop designing one final object.

As a graphic design student, he became interested in generative design: rules, software, and interactions that could keep producing new designs. Things clicked in my head when I listened to his Tech Talks interview where he described the shift from his personal experience.

Miles Palmer, Tech Talks:

“Rather than designing the end artifact, you are designing the system, and the system creates the design.”

That early idea now reaches far beyond product requirements. A product manager can build a prototype, an engineer can supervise several coding agents, and an operator can turn a recurring workflow into an agent that runs overnight. The leverage has moved from defining features and creating code to designing the system that produces the visible artifact: context, tools, permissions, models, evaluation, memory, and the route back to a human decision.

Palmer also describes the cost. When his concentration drops, he hands work to AI and keeps producing. The natural stopping point disappears. He gets through more work while wondering whether he is preserving the judgment he once practiced directly.

The tension in the work has not gone away; it has shifted in the new job. The system can create more output than a person could make alone. The job is now to decide what the system should do, what resources the work deserves, how quality will be judged, and when production should stop.

The Azure Team Made Its Work Meta

Satya Nadella gives the idea an operating shape in his No Priors conversation during the recent Microsoft Build. According to Nadella, Microsoft built more Azure capacity in 15 months than it had built during Azure’s first 15 years. The networking team facing that growth did not expect the existing workflow to scale through more effort. It reconceived the job.

Satya Nadella, No Priors:

“Our job is not to do Azure networking. Our job is to build the agentic system that does Azure networking.”

The outcomes reach physical infrastructure. Fibre gets cut. Operators send messages. Repairs have to be coordinated across more than 500 providers. The team did not escape the operation by placing a chatbot beside it. It built a system that could receive signals, use tools, carry context, act through a workflow, and return exceptions to people.

Nadella calls this “meta work.” The networking team works on the mechanism that performs more of the networking operation. Work happens on the mechanisms, not the outputs.

The same pattern appears in his description of enterprise AI harnesses. A working system joins models, data, tools, and context. It exposes tools progressively instead of loading every possibility into every step. It records traces. It uses private evaluations to measure performance on work the company itself values. It gives a person an interface for inspecting what long-running agents did with delegated authority.

This is not prompt writing. This job includes choosing the model, preparing context, defining action rights, creating the eval, observing the run, and changing the system when it fails.

Previously, I wrote about a reviewed change loop: signal, proposed change, judgment, tests, release, monitoring, and the next signal. System design sits around that loop. It determines which signals enter, what the agent can touch, how much computation the task receives, what evidence appears at review, and whether the next run benefits from what the last one learned.

The artifact becomes one event inside a system.

Agency Is Part of the Architecture

A technically capable system can remain organizationally inert. It retrieves the answer, prepares the change, and waits inside the same approval structure that slowed the human workflow.

Elena Verna makes that problem concrete in Your Company Needs Agency, Not Agents. Companies often restrict information by title, route decisions to a small group, and treat employees as risks to contain. Adding agents gives people more production capacity without giving them the context or authority to use it.

At Lovable, Verna says departmental agents have a “parent”: a person with deep context who keeps the agent accurate and current. That solves only part of the problem. Employees also need access to the context and permission to act on it.

Agency has four practical parts:

enough context to understand the situation;
permission to take a bounded action;
accountability for the result;
a way to reverse course when the decision is wrong.

Remove any one of them and the system falters. Context without action rights creates a better-informed queue. Autonomy without accountability creates unmanaged risk. Accountability without reversibility makes every decision expensive, which sends authority back up the hierarchy.

Bain’s 2026 survey of 951 companies shows how far production systems remain from the fully autonomous picture used in many investment cases. Seven percent of respondents reported fully autonomous agents in production. Human approval was the dominant model at 38%, while another 32% used guardrails and exception handling.

Human involvement is not evidence that the system failed. Consequential work often should stop for a person. The design error is pretending that review, exception handling, and accountability are free. If the business case assumes full automation while production routes a large share of decisions into a human queue, the budget describes one system and the organization operates another.

Bain also asks leaders to name the person accountable for a consequential wrong decision before deployment. Verna adds the other side of that bargain: employees receiving more autonomy have to accept that a failed launch or damaged metric can genuinely be their responsibility.

The agent’s permissions and the employee’s decision rights belong in the same design.

Token Yield Is Designed

Resource allocation enters much earlier than the monthly cloud bill.

Arvind Jain proposes token yield in his post Your Token Spend Is an AI Architecture Problem, Not Just a Model Problem: the useful outcome produced per token consumed.

A short instruction can hide a large workload. “Analyse churn risk and create follow-up tasks” may bring system instructions, tool definitions, retrieved documents, memory, execution traces, intermediate results, and repeated reasoning into the active context. The user typed one sentence. The system may process tens of thousands of tokens before acting.

The model price explains only part of that consumption. Architecture decides how much material the model receives, which tools it sees, how the task is divided, whether prior work is reused, and which model handles each step.

Jain identifies four levers.

First, retrieve better context. A model cannot know that half of the supplied material is stale, redundant, or about a similarly named metric from another business. It processes the context it receives. Noisy retrieval spends tokens on assembling the problem and can still produce a worse answer.

Jain cites a Glean benchmark in which Glean’s centralized index was reportedly preferred about 2.5 times as often as off-the-shelf MCP tools, while those tools consumed around 30% more tokens. When the off-the-shelf tools won on correctness, Jain reports that they needed roughly 83,000 tokens compared with 43,000 for Glean. This is vendor evidence, not an independent general result. It still illustrates the mechanism: weak retrieval can force more tool calls, more over-fetching, and more reasoning loops.

Second, route models by task. Search, retrieval planning, validation, and execution management do not always require the most expensive frontier model. A multimodel system can preserve frontier reasoning for ambiguous or differentiated work and send narrower steps to smaller models.

Third, learn from prior execution. A recurring workflow should not rediscover the same tool sequence or repeat the same failed path every morning. Traces can show which retrieval path worked, which calls were unnecessary, and where the human rejected the result. The next related task should become cheaper or more reliable.

Fourth, manage the harness. A naive long-running agent keeps adding tools, instructions, state, and intermediate output to an expanding context window. A stronger harness gives each step the working set it needs, externalizes durable state, and scopes tools to the current action.

These choices explain why token efficiency is a system property. A cheaper model inside a wasteful harness can cost more than an expensive model reached through clean context and a short path.

Nadella reaches the same conclusion from the harness side: progressive tool disclosure and prepared context help keep a multimodel system efficient. Perplexity CEO Aravind Srinivas describes orchestration as balancing accuracy, latency, cost, privacy, intelligence, and energy while choosing models and execution locations. Ars Technica’s account of GitHub Copilot’s usage pricing shows the same principle at user level: a long-running chat resends old context, while model selection can change the cost of an otherwise similar request.

The system designer allocates intelligence. Spending more is sometimes correct. Spending blindly is not.

Efficiency Includes Human Attention

Token yield remains incomplete if the system saves compute by sending more work to people.

Nadella describes running a hundred coding-agent sessions and receiving the cognitive load back as a human. Chat cannot remain the only artifact. The operator needs a canvas or another interface that shows what happened, what changed, and where attention is required.

That interface is part of the system, just as retrieval and routing are. A person should not have to reconstruct every agent path or read every line to find the two decisions that carry risk.

Bain’s survey makes the review load economically relevant. Most production agents in its sample still rely on approval, guardrails, or exceptions. A system may look inexpensive in token accounting while creating an unmeasured queue of checks, corrections, and escalations.

Palmer’s experience brings the cost inside one person. AI allows him to continue when he would previously have stopped, walked away, or let an idea sit. More activity fills the recovered capacity. Reflection does not automatically survive.

Miles Palmer, Tech Talks:

“I’m now able to get through a much bigger volume, but along the way I feel like I’m eroding my brainpower to a degree.”

This is not evidence that AI causes general cognitive decline. It is evidence from an experienced practitioner that production capacity and healthy work can diverge.

A well-designed system preserves stopping points. It shows uncertainty. It groups related changes. It escalates the small set of decisions that need expertise. It lets a person inspect the trace without reliving the entire run. It also leaves space for the person to form an opinion before the machine supplies one.

A system that saves tokens while consuming more judgment is inefficient. A system that protects attention by sending every task to the most expensive model may be inefficient too. The design problem includes both resources.

Output Metrics Misdescribe the Job

The easiest AI measures sit closest to production: tokens consumed, lines generated, pull requests merged, tasks completed, and agents launched.

They show that something happened. They do not show that the system improved.

Anthropic reports that more than 80% of the code merged into its codebase in May 2026 was authored by Claude. It also reports that the typical engineer merged eight times as much code per day as in 2024. Anthropic immediately qualifies the number: lines of code measure quantity over quality and almost certainly overstate the true productivity gain.

The more consequential evidence appears elsewhere in the report. Claude can execute well-specified experiments quickly. Humans still choose many of the problems and create the scoring rubrics. As generation expands, code review becomes a constraint. The job changes from typing the output toward selecting the problem, defining success, reviewing evidence, and building systems that verify the result.

Jellyfish’s analysis of 12,000 developers across 200 companies reaches a similar warning from a different direction. In the joined subset of roughly 7,500 developers, pull-request throughput rose with token use. The highest-usage group achieved about twice the throughput while using roughly ten times the tokens per pull request. Jellyfish is a vendor, and merged PRs remain an incomplete measure of software value. The data still shows why token consumption cannot stand in for impact.

Bad incentives make the distinction visible. Uber reportedly encouraged employees to use AI “as much as possible” and ranked internal usage before exhausting its annual AI budget in four months. It later introduced a $1,500 cap per employee, per month, for each agentic coding tool, including Claude Code and Cursor; employees can exceed it with permission. Uber’s COO said drawing a line between the usage and new consumer features was difficult.

At the other extreme, Sam Altman said OpenAI’s heaviest internal user consumed about 100 billion tokens in a month. The number demonstrates scale. Without the work produced, accepted, or improved, it says little about value.

Lines of code, PRs, and tokens are exhaust from the system. A team can produce more of all three while creating maintenance debt, review fatigue, or features customers ignore.

Measure Resource Conversion

The measurement chain needs to extend beyond activity:

Resources -> system behavior -> accepted result -> durable outcome

Resources include tokens, model time, energy, human review, and the context-maintenance work around the system. System behavior includes correct routing, retrieval quality, tool use, stopping, escalation, and recovery. An accepted result survives evaluation and human judgment. A durable outcome changes something worth preserving for a user or the operation.

That chain supports more revealing measures:

tokens and cost per accepted result;
human review time per accepted result;
tokens spent on work that is rejected or reverted;
quality loss when a cheaper model handles the task;
correct stop and escalation rates;
time from an operational signal to a verified resolution;
repeated failures converted into tests, routing rules, or reusable context;
lower resource use when the system encounters a related task again;
customer or operational outcomes after the result enters production.

No team needs every measure. It needs a small set that describes the system it is trying to improve.

Nadella’s private-eval argument provides a control point. A company should be able to test its own work, switch models, and know whether performance held. Jain adds a compounding test: each completed task should improve the economics of the next related task. Bain pushes measurement up to better decisions, faster responses, and stronger customer outcomes.

These measures also change the daily job. The PM defines the outcome and the evidence that would make a result worth accepting. The engineer shapes the harness, architecture, tests, and observability. The operator identifies exceptions and turns repeated failures into system changes. Teams decide where additional tokens or human attention have a plausible return.

People still make artifacts. Their leverage increasingly comes from improving the conditions under which artifacts are made.

Preserve Judgment Inside the System

Palmer’s generative-design idea began with possibility. A system could produce an open-ended range of designs and let people interact with the result. AI gives that idea a new scale. The system can now research, write, code, call tools, remember, and act.

Its quality depends on more than the final artifact.

The system needs context that is current and bounded. It needs permissions that match the risk. It needs routing that spends expensive intelligence where it earns its cost. It needs evals tied to work the organization values. It needs traces a person can inspect, stopping points that protect attention, and feedback that makes repeated work cheaper and better.

It also needs people who can judge the result without the system doing all of the judging for them.

A worker produces an artifact. A system designer decides which artifacts should exist, what resources they deserve, how they will be evaluated, who can act on them, and how each execution improves the next.

That is the new job.

References

Is the Race to Do More with AI Eroding Our Brainpower?, Miles Palmer / Tech Talks, June 3, 2026.
The Rise of the Full-Stack Builder and Hyper-Leveraged Generalist, Satya Nadella / No Priors, June 4, 2026.
Your token spend is an AI architecture problem, not just a model problem, Arvind Jain, June 4, 2026.
Your company needs agency, not agents, Elena Verna, June 5, 2026.
Your AI Budget Is Growing. Your Returns Aren’t. Here’s Why., Bain & Company, June 2, 2026.
When AI builds itself, Anthropic Institute, June 4, 2026.
Is tokenmaxxing cost effective?, Jellyfish, April 15, 2026.
Uber caps employee AI spending after blowing through budget in four months, TechCrunch, June 2, 2026.
AI costs how much? GitHub Copilot users react to new usage-based pricing system, Ars Technica, June 2, 2026.
Perplexity CEO tells CNBC one metric will determine who wins the AI race, CNBC, June 3, 2026.
Sam Altman says OpenAI’s top token spender uses 100 billion tokens a month, Business Insider, June 3, 2026.

AI Operating Systems by Mart Roosimägi

Discussion about this post

Ready for more?