The New Software SOP

May 31, 2026

I’ve restarted writing here in Substack. Different professional path now, different articles. I’ve relied heavily on Codex when writing this and I’ll continue doing so throughout the writing. The outputs reflect what themes are top of mind for me at any given time + I’m heavily vested in getting a grip how AI generated output is of good quality.

This is a personal publication. Views expressed below are my own.

The awkward thing about agent-written software is that it can look ready before the team has decided whether it is worth shipping. A pull request can include code, tests, screenshots, and a product note. It can pass CI. It can still be unclear what customer behavior should change, what risk was introduced, and who will notice if the release quietly fails.

Agentic software work makes implementation less scarce, but accepted change remains scarce. The limiting work moves into context, review, evidence, cost control, ownership, and learning after release.

Teams that optimize for more tickets, demos, pull requests, or token usage will get more plausible work than they can trust. Teams that optimize the loop around change can turn agent output into product learning. The work is not a better feature factory. It is a reviewed change loop.

Code Got Cheaper; Accepted Change Did Not

AI still has the messy shape of an early platform shift. Benedict Evans compares the moment to 1997 for the internet: the technology is clearly important, many of the winning products do not exist yet, adoption is uneven, and a lot of current usage patterns will look primitive later. Operating discipline becomes more important when the product category is still unstable. Teams should expect experiments, false starts, sudden capability jumps, and strange new workloads rather than one clean replacement of the old software process.

Feature lists lose leverage when they become the main control system. They remain useful when they express choices, sequencing, dependencies, evidence needs, architecture runway, regulatory readiness, and release coordination. The weak version of a feature list is different: a queue of requests waiting for delivery capacity, with customer value assumed somewhere after shipment.

The weak version made more sense when engineering time was the scarce resource. Teams built roadmaps, sprint plans, dependency boards, portfolio reviews, and capacity rituals around the problem of allocating human implementation time. Some of those rituals still matter. Large products still have dependencies. Regulated products still have readiness constraints. Shared infrastructure still has sequencing problems. Release trains do not disappear because an agent can write a diff.

Planning itself is not the issue. Planning breaks when it treats feature execution as the scarce value-creating step.

In his conversation with Lenny Rachitsky, Benedict Evans separates the generated artifact from the surrounding job. Claude Code can make features, but the harder work is deciding what features should exist, who the customer is, what the product should do, and how the company will take it to market.

Benedict Evans, Lenny’s Podcast:

“What is the hard part? Is it the task or the job?”

A model can make the ticket move faster, but it cannot make a weak product decision strong. It can make the weak decision look more finished. The artifact arrives with screenshots, copy, tests, and a demo. The delivery system receives it as progress. The customer may still not care.

Ryan Lapopolo brings out the same constraint shift inside the codebase. In Aakash Gupta’s interview on how OpenAI PMs ship code, he says software organizations were shaped by the assumption that code was expensive to produce. When code can be generated in parallel, the scarce work moves into context, validation, architecture, review, and judgment.

Ryan Lapopolo, Aakash Gupta interview:

“The code is trivial to generate.”

Coding has not stopped mattering. The team has to stop using implementation as the main proxy for progress. A change is not valuable because it was produced. It becomes valuable when it improves product behavior, preserves system quality, and leaves enough evidence for the next person or agent to understand it.

Evans’s accounting example explains why this does not necessarily mean less work. Adding machines, punch cards, mainframes, databases, ERP systems, cloud software, spreadsheets, and PCs did not make accountants vanish; accounting employment kept growing because cheaper calculation changed the return on doing more analysis, reporting, control, and planning. Software may rhyme with that. If agents make implementation cheaper, teams may build more software, test more variants, inspect more customer behavior, and carry more responsibility for deciding which changes are worth keeping.

The Reviewed Change Loop

The reviewed change loop becomes the unit of work:

Customer signal -> proposed change -> human judgment -> tests and evals -> living docs -> risk evidence -> release -> monitoring -> next signal.

The loop is not a heavier process by default. It is a way to move evidence to the place where decisions happen. A pull request can be one container for the loop, but it is not the loop itself. A trunk-based team, a paired team, a regulated release process, or an agent-orchestrated workflow can all use the same underlying pattern.

An agent-assisted change needs an evidence pack that answers seven questions:

StepRequired evidenceHuman decisionCustomer signalSource, frequency, severity, affected segment, exampleIs this worth exploring?Proposed changeProduct note, expected behavior change, alternatives, deletion conditionShould this exist?ImplementationDiff, tests, screenshots, dependency changes, architecture impactIs it safe enough?Risk checkData touched, permissions, compliance trigger, rollback pathDoes this need stronger approval?ReleaseFlag, monitor, support note, owner, release noteCan we observe impact?LearningMetric movement, incidents, feedback, customer examplesKeep, adjust, or delete?System updatePRD/doc diff, new guardrail, new test, runbook changeDid the system learn?

The table earns the SOP in the title. It does not replace judgment; it gives judgment something to inspect.

Metrics have to change too. PR count, token burn, ticket closure, and demo volume are weak measures on their own. Better measures include customer-signal-to-proposed-change time, proposed-change-to-reviewed-decision time, percentage of generated changes rejected or deleted, defect escape rate for agent-assisted changes, review load per engineer, share of changes with a complete evidence pack, rollback frequency and rollback time, cost per accepted change, outcome movement after release, and PRD or documentation freshness.

Teams do not need all of those measures on day one. They need to measure accepted change rather than generated artifacts.

PM Work Changes

The PM shifts from writing requests to shaping reviewable change.

The PM connects a customer signal to an expected behavior change, defines what evidence would justify implementation, decides what can be tested cheaply, keeps the PRD alive as the product changes, and helps delete changes that do not move the outcome. The PM is not merely the person who writes the prompt, the ticket, or the PRD. The PM owns the product decision as it travels through the loop.

Barry O’Reilly’s Mind the Product conversation starts with behavior change. His examples center on capture, synthesis, challenge, and action: everyday work becomes reusable context, decisions are prepared better, and teams stop treating activity as the same thing as progress.

Barry O’Reilly, Mind the Product:

“If you start with tools, you’re going to fail.”

AI-assisted product work should start closer to customers than to agent capacity. A support pattern, onboarding failure, sales objection, usage trace, failed experiment, or customer call should become a product note with expected behavior change and evidence needs. The agent can help generate candidate changes, but the PM has to define what would make one worth accepting.

11FS and Backbase describe banking work as customer intent moving through systems, people, policy, and agents until the intent is resolved. Their language is customer intent, resolution loops, workflow metadata, truth layers, and governed execution. The valuable metric is not whether an AI surface responded. It is whether the resolution loop became shorter, safer, and more understandable.

Ordinary software needs the same traceability. The PM should be able to trace a customer signal into a proposed change, the relevant test or eval, the release note, the documentation update, and the monitor that will show whether the change worked. That is different from grooming a backlog. It keeps the customer inside the release loop.

OpenAI and Anthropic are leaning into consultancies, forward-deployed engineers, and private-equity implementation partners because workflow redesign still takes people. Evans points out that companies do not have idle teams waiting to reimagine workflows, connect systems, retrain people, and manage the politics of change. If AI creates more implementation capacity, it also creates more redesign work. The PM version of that work is not writing a longer requirements document. It is finding the customer loop where the new capacity should land.

Engineering Work Changes

Engineers create leverage by designing the system in which agents work: repository structure, architecture boundaries, test speed, fixtures, observability, permissions, review agents, rollback paths, and conventions that make generated changes inspectable.

Ryan Lapopolo’s OpenAI examples show what the agent needs before generated code becomes trustworthy: a legible repository, modular architecture, fakes for dependent services, fast validation, browser loops, local observability, review agents, and documentation it can consume. When Codex makes a mistake, the team does not merely correct that one output. It changes the repo, docs, tests, harness, or workflow so the same class of mistake becomes less likely.

Agent performance depends on the surrounding system, not only the model. The system-scaling paper names context, tools, retries, verification, permissions, human intervention, and auditability. Tomasz Tunguz’s agent harness framing gives a practical inventory: context and memory, tools and action, orchestration, state, sandbox, observability and governance, cost and workflow optimization.

An engineer once created leverage by writing the hard code directly. Now the leverage often comes from making the codebase a better workplace for agents and humans. Tests have to be fast enough to run often. Architecture boundaries have to reduce where code can go wrong. Logs and traces have to be readable by a person and useful to a model. Fixtures have to make behavior reproducible. Review agents need rules. Humans need proof of work.

Engineers remain central. The craft lives in the environment that produces and accepts change.

Docs, PRDs, Compliance, And Risk Move Into The Pipeline

A stale PRD is a historical artifact. A living PRD is part of the system.

When a PRD describes the product as imagined three months ago, while the code and customer behavior have moved on, it confuses the team. When it ships with the release, changes as the application changes, and records why decisions were made, it gives the team something to inspect. A PM can read it. An engineer can challenge it. An agent can use it. A future incident responder can understand why the system behaves the way it does.

Ryan Lapopolo describes product-sense docs, security expectations, QA plans, feature documentation, and guardrails that live in files the agent can read. Jason Liu’s Codex-maxxing workflow uses durable memory, review surfaces, browser feedback, and verification-based goals. Both point to the same operating habit: context should not remain trapped in a meeting, a chat thread, or a person’s head if an agent is expected to act on it later.

Risk and compliance need to move closer to the pipeline, with restraint. Compliance should not become a fantasy of total automation. 11FS/Backbase discuss truth layers, evidence ledgers, agent identity, entitlements, policies, pre-checks, post-checks, and observability. Shub Agarwal, in Your AI Pilot is Lying to You, argues for production paths, evals as acceptance criteria, and trust as a first-order feature. Aaron Levie points to access controls as one of the main reasons agentic work outside coding is harder than coding.

Shub Agarwal, The Data and AI Chief:

“Iteration is the product.”

Risk evidence should iterate with the product too. A change touching payments, permissions, customer data, pricing, or regulated behavior should not depend on someone remembering to involve risk at the end. The policy check, approval gate, audit trail, entitlement diff, data lineage note, and rollback path should be visible in the change itself.

Trust attaches less to a role label and more to an evidence path: customer signal, product decision, data used, code changed, tests run, policy checks passed, release observed, and user outcome after release. A trust score may help in some settings, but the score is not the operating model. The evidence path is.

Human-Readable Code Becomes Disaster Recovery

Agent-written code still has to be readable on the bad day.

The bad day might be an outage. It might be a credit limit. It might be a vendor policy change. It might be a security decision that blocks a tool. It might be a production incident at the exact moment the frontier coding agent is unavailable. At that point, the codebase cannot require the original agent, the original prompt thread, or the original model provider to understand what happened.

The Verge reported that Microsoft planned to remove most Claude Code licenses from one large internal group and move developers toward Copilot CLI, with cost and product-control considerations in the mix. Axios described broader AI sticker shock as companies move from experimentation into material spend. Aaron Levie says token budgeting is now one of the hottest enterprise issues. Lars Faye adds the engineering warning in Agentic Coding is a Trap: vendor lock-in, outages, cost swings, and skill atrophy become real operating risks.

Lars Faye, Agentic Coding is a Trap:

“An employee’s cost is fixed; tokens are a constantly moving target.”

Rejecting coding agents does not solve this. Recoverability has to become part of engineering quality. A human should be able to read the code. A smaller local model should be able to inspect the architecture, run the tests, and propose a narrow patch. The repository should have runbooks, fixtures, architecture notes, traces, and validation commands that survive outside the frontier-agent session.

Call the pattern agent dependency disaster recovery, or manual/local-LLM development fallback. The team keeps enough structure in the system that production work can continue in degraded-agent mode.

George Hotz warns that agents can frontload progress and leave polish, correctness, and detectability problems behind. Large organizations are especially exposed because slower feedback loops and weaker self-checks can convert plausible output into average-quality decline.

The SOP should assume this risk. If agents make more code, the team needs stronger architecture boundaries. If agents make more tests, the team needs to know which tests prove product behavior. If agents make more docs, the team needs reviewable diffs. If agents make more features, the team needs a faster way to decide which features create durable value for users.

The PR Is Only One Container

The reviewed change loop should not turn into worship of the pull request.

Andrea Laforgia’s Stop Using Pull Requests attacks the ritual without attacking quality. He argues that PRs were designed for low-trust open-source contribution and can become slow inspection queues inside private teams. His alternative is TDD, trunk-based development, and team-focused development: build quality earlier, integrate continuously, and review during creation.

Andrea Laforgia, Stop Using Pull Requests:

“The problem is not review itself, but the speed of review.”

A pull request can be a good container for evidence: context, diff, tests, screenshots, eval traces, docs, policy checks, and release notes. But if the PR becomes a waiting room for plausible machine output, the team has recreated the old bottleneck with higher volume.

The rule is smaller and stricter. Every change should be reviewable, tested, owned, and connected to the signal that caused it. Some teams will do that through PRs. Some will do it through trunk-based changes, feature flags, pairing, review agents, post-commit checks, or release gates. The specific ritual should follow the system’s risk, team trust, and verification strength.

The New SOP

The old feature factory had a comforting shape. A request entered. A team delivered. The process could be measured even when the product outcome was vague.

Agentic software work breaks that comfort. It can generate too much plausible work. It can make weak ideas look finished. It can create cost surprises. It can hide complexity behind a clean diff. It can also make teams faster, more experimental, and more closely connected to users when the operating model is built for it.

The new SOP is the reviewed change loop. A customer signal arrives. The team turns it into a proposed change. A human reviews the product decision and the evidence. Tests, evals, policy checks, and docs move with the code. The release is monitored. Incidents and failed tests become candidate patches. The PRD changes when the product changes. The next signal enters a system that now knows more than it did last time.

The loop is still work. It may be more work at first. The focus shifts from managing delivery resources toward assessing outcome value, preserving system quality, and making trust visible. The team spends less time waiting for someone to type code and more time deciding what deserves to become durable product behavior.

The generated change from the opening is not complete because an agent wrote code, tests, and a product note. It becomes complete only when the loop around it holds: user signal, architecture, review, validation, docs, risk, release, monitoring, recovery, and the next change.

Teams that build that loop will turn agent output into product learning. Teams that keep the feature factory will get more work than they can trust.

References

I got an inside look at how OpenAI PMs ship code, Aakash Gupta and Ryan Lapopolo, May 25, 2026.
A rational conversation on where AI is actually going, Benedict Evans / Lenny’s Podcast, May 31, 2026.
State of Enterprise AI 2026, Aaron Levie / The MAD Podcast, May 28, 2026.
Why your AI strategy is failing, Barry O’Reilly / Mind the Product, May 27, 2026.
Your AI Pilot is Lying to You, Shub Agarwal / The Data and AI Chief, May 27, 2026.
11FS, Backbase: Can traditional banking survive the AI era?, 11FS / Backbase, May 14, 2026.
From Model Scaling to System Scaling, May 25, 2026.
Software After AI - AI Agent Harness, Tomasz Tunguz, May 27, 2026.
Codex-maxxing, Jason Liu, May 10, 2026.
Claude Code’s creator on the end of the software engineer, Casey Newton, May 26, 2026.
The Eternal Sloptember, George Hotz, May 24, 2026.
Agentic Coding is a Trap, Lars Faye, April 13, 2026.
Stop Using Pull Requests, Andrea Laforgia, March 19, 2026.
Microsoft starts canceling Claude Code licenses, The Verge, May 14, 2026.
AI sticker shock hits corporate America, Axios, May 28, 2026.

AI Operating Systems by Mart Roosimägi

Discussion about this post

Ready for more?