When AI Makes Building Cheaper, Product and Engineering Roles Must Move

May 24, 2026

I’m restarting writing here in Substack. Different professional path now, different articles. I’ve relied heavily on Codex when writing this and I’ll continue doing so throughout the writing. This is more of a curation approach and building systems, not typing the output. The outputs reflect what themes are top of mind for me at any given time + I’m heavily vested in getting a grip how AI generated output is of good quality.

This is a personal pubication. Views expressed below are my own.

AI coding tools can now produce work that looks ready before anyone has really lived with it: a pull request with tests, a prototype with polish, a plan with confident trade-offs. The review queue is where that work becomes product or becomes residue.

The source base splits into three clusters: Simon Willison on agentic engineering and vibe coding with Dan Shipper on automation creating more human work and more independent PM/design work; operating-model pieces on ADLC, harnesses, and inspectable artifacts; and risk pieces on overread AI confidence and tokenmaxxing.

As teams move up the usage curve, the focus shifts from first-draft production to absorption: deciding what is worth trusting, integrating, operating, maintaining, or deleting. That changes both roles. Engineers move toward technical coherence, contribution pathways, tests, observability, and safe operating models. Product managers take more responsibility for making ideas concrete, evidence-backed, and reviewable.

Generated Work Creates an Absorption Problem

Simon Willison’s notes from his Lenny’s Podcast conversation about agentic engineering give a useful starting point. Code is a bellwether for the rest of knowledge work because code is more checkable than many other outputs. It either runs or it does not, at least in the simple case. Yet even in code, the question has changed.

Simon Willison, “Highlights from my conversation about agentic engineering on Lenny’s Podcast”:

“How do we get from most of it works to all of it works?”

As teams move up the usage curve, the focus shifts from whether a model can produce something useful to how a team gets from mostly working output to something it can trust, own, and operate.

That shift appears quickly in product work. Willison describes being able to prototype several versions of a UI because the cost of a prototype has fallen so far. This is a real gain. Teams can explore more possibilities before committing. But the next question becomes harder: if three plausible versions appear instead of one, who decides which one deserves user testing, which one fits the product, which one is technically sustainable, and which one should be deleted?

Dan Shipper makes the same mechanism less defensive in After Automation and in his Lenny’s Podcast conversation. In the episode, he compresses the paradox into two sentences.

Dan Shipper, “The AI paradox: More automation, more humans, more work”:

“Automation is a lie. Every agent needs a human.”

At Every, AI is used across coding, writing, design, customer service, and operations. The company has become more AI-forward while still needing more people to shape, supervise, review, and integrate the work.

AI reduces the cost of attempts. It does not remove the cost of judgment.

This is the absorption problem. A pull request is not delivered value. A prototype is not product learning. A generated product plan is not a decision. A passing test suite is not proof that the thing belongs in the system. Each artifact still has to be understood, reviewed, routed, tested, accepted, changed, or rejected.

Some attempts are valuable. Some are junk. Many are plausible enough to demand attention. The old system was designed for lower output volume and clearer authorship. The new system needs better ways to inspect, route, test, decline, and absorb work without turning review into a swamp.

Productivity becomes slippery here. If one person creates five artifacts where they used to create one, the local metric looks good. But if four of those artifacts create review burden for other people, the system may not have become faster. It may have moved the cost into a quieter queue.

Scarcity Moves From Production to Absorption

The old product-engineering operating model assumed that production capacity was scarce.

That assumption shaped the rituals: roadmaps, backlogs, prioritization forums, sprint planning, acceptance criteria, delivery tracking, and escalation when capacity was not enough. These practices still matter. Large organizations still need coordination. Scarce engineering time has not disappeared.

But AI changes the relative scarcity.

More people can now produce something concrete before a formal build process begins. A product manager can create a working prototype. A designer can test an interaction. An analyst can build a rough workflow. An operator can generate a small internal tool. An engineer can run several agents in parallel. A team can produce more options, more variants, and more partial implementations than before.

The constraint moves downstream.

Absorption includes review capacity, technical judgment, product judgment, compliance judgment, user validation, maintainability, ownership, monitoring, and deletion. A generated artifact has little value until the organization can decide what to do with it.

The CTO Playbook episode on moving from SDLC to ADLC gives this an operating-model frame. The useful warning is that faster generation exposes the rest of the delivery system.

Kyle Horner, “CTO Playbook: From SDLC to ADLC”:

“If we tripled our code output tomorrow, what would break first?”

Testing, review standards, CI/CD, documentation, release discipline, and shared working methods decide whether speed can be used.

Sprints and SAFe-style coordination start to look less central under that pressure. They were built for a world where scarce delivery capacity had to be allocated carefully across teams. That constraint still exists. Higher on the AI usage curve, another constraint starts to dominate: value assessment. More people and agents can produce work, so the harder organizational question becomes which attempts deserve attention, which belong together, which should be killed early, and which have actually improved the user’s life after release.

A sprint can still create cadence. A portfolio process can still prevent chaos. But cadence and portfolio hygiene do not answer the new question by themselves. When the system can generate more options than the organization can absorb, the useful operating model has to put more weight on outcome review, evidence quality, user learning, reversibility, and deletion.

The old rituals can survive only if they become better at judging value, not only better at scheduling delivery.

Engineers Need to Let Go of Some First-Draft Control

Engineers remain central in this world. Their leverage changes.

If AI lets product managers, designers, analysts, and operators create useful technical artifacts, engineering cannot respond only by defending the old boundary around production. That boundary will become harder to maintain and less useful. The stronger engineering response is to design safer contribution pathways.

This means engineers move from being the only builders to being stewards of technical coherence.

They define what safe contribution looks like. They decide which artifacts can remain local and which need formal review. They create tests, templates, traces, dependency checks, rollback paths, and review standards. They shape how agents can touch repositories, how generated code is evaluated, and what evidence must travel with a proposed change.

Willison’s later reflection, “Vibe coding and agentic engineering are getting closer than I’d like”, captures the discomfort. As coding agents become more reliable, it becomes rational to inspect less of what they produce in some situations. That is valuable and dangerous at the same time. The old review habit assumed a human author had been close to the work. That assumption weakens when, as Willison puts it, “Claude Code does not have a professional reputation.”

The harness around the model matters here. Harness Engineering 101 uses “harness” for the environment around the model: tools, context, execution, memory, guardrails, evaluation, observability, and orchestration. The useful point is that model output becomes reviewable only when the surrounding tooling exposes enough context for humans to judge it.

If the harness produces logs, traces, tests, diffs, context summaries, and failure reports, review becomes easier. If the harness only returns a polished artifact, reviewers have to reconstruct the process after the fact.

So the engineering question becomes less “Did I personally write or inspect every line?” and more “What evidence tells us this change is safe, coherent, reversible, and worth owning?” Willison’s own test is blunt: “I want somebody to have used the thing.”

That is a higher-leverage engineering problem.

It also requires letting go. Safety cannot come only from restricting who can produce artifacts. In an AI-assisted organization, safety must also come from making artifacts inspectable, testable, reversible, and properly owned.

Engineers should not become the cleanup function for AI output. They should build the system that lets more good work enter without lowering the bar.

PMs Need to Take On More Before Handoff

The product management role also has to move.

If product managers can create prototypes, pressure-test assumptions, inspect data, draft workflows, generate PRD variants, and sometimes produce rough technical artifacts, then the old request-and-prioritize role becomes too thin.

AI removes some of the excuse that product intent can stay vague until engineering translates it.

A PM-created artifact should not arrive as “AI made this, can you check it?” It should arrive with enough structure for engineering, design, risk, compliance, or business stakeholders to know what kind of attention it needs.

A good AI-assisted PM artifact should make several things visible:

the user problem;
the decision needed;
the intended use of the artifact;
the assumptions behind it;
the sources used;
the alternatives considered;
the known gaps;
the parts that are disposable;
the parts that could affect durable systems;
the specific questions that need engineering judgment.

Shipper’s optimism about PMs helps explain the upside. In the Lenny episode, one of his explicit predictions is that PMs will thrive in the AI era. The reason is not that PMs suddenly become full-time engineers. The reason is that PMs with product judgment, customer context, and enough technical fluency can carry ideas further before asking the rest of the system to react.

That creates more independence, but also more accountability.

The PM’s job moves from writing a request to shaping a reviewable artifact. The artifact might be a prototype, an internal tool, a workflow, an analysis, a customer journey, a product plan, or a structured decision memo. The important point is that it should reduce ambiguity for the next person, not transfer ambiguity to them.

A PM who uses AI well does not simply produce more. They produce work that is easier to judge.

Review Is Where AI Work Becomes Real

Review catches bad AI output. That is only the defensive version.

The more constructive version is that review gives newly capable people a way to meet the shared system. A PM-built prototype can become a serious product discussion. A designer’s generated interaction can become a testable product direction. An operator’s rough automation can become a managed internal tool. A support analysis can become product evidence. An agent-generated bug report can give engineering better reproduction steps than a human would have written.

Review is where local work either joins the durable system or remains a useful local experiment.

This requires better evidence than “the output looks complete.”

For code, evidence might include tests, dependency diffs, screenshots, traces, reproduction steps, threat-model notes, performance checks, or a clear statement of what the human owner actually verified. For product artifacts, evidence might include sources, assumptions, discarded alternatives, user observations, known uncertainties, and what would change the recommendation. For agent workflows, evidence might include permissions, logs, escalation paths, failure modes, and rollback options.

The review surface should fit the risk and durability of the artifact.

A private prototype can stay lightweight. A throwaway script can have a lower bar. A customer-facing workflow, regulated decision process, security-sensitive tool, or durable system change needs stronger evidence and clearer ownership.

The practical rule is simple: where the cost of being wrong is low, move quickly. Where the cost is high, make the evidence travel with the artifact.

Artifacts Decide Whether Review Is Possible

Review is not an abstract virtue. It is a behavior that has to fit human attention.

That is why artifact format matters. In the ChatPRD interview with Thariq Shihipar, “Replacing Markdown with HTML for AI-Powered Development”, long Markdown plans become walls of text. People stop reading them, or they ask the model to edit the plan instead of engaging with it. HTML turns a generated plan into something visual, navigable, and interactive: mockups, risk assessments, file structures, code snippets, mood boards, decision rules, and editing interfaces.

That is a product-management point, not only a tooling point.

The format of the artifact shapes the quality of review. If the artifact hides important decisions in a long document, the team will miss them. If the artifact shows assumptions, options, risks, sources, and decision points clearly, the reviewer can engage.

This is a new area of PM craft. Product managers need to think about the internal surfaces through which AI-assisted work becomes legible. What should the reviewer see first? What should be editable? Where should uncertainty appear? Which source links must be visible? What should be a prototype, what should be a table, what should be a diff, and what should be deleted after it has served its purpose? In the ChatPRD interview, Thariq Shihipar calls part of this role being a “compute allocator.”

Engineers face the same question from the system side. If agents can generate more code than the team can read line by line, engineering leverage comes from the harness around the model: tests, logs, traces, permissions, context, guardrails, evaluation, and deployment controls.

Speed that cannot be inspected becomes mystery.

Bad Signals Can Flood the Queue

Two failure modes can make AI-assisted work look better than it is.

The first is misplaced confidence. The Communications Psychology research summarized by Phys.org describes an “illusion of confidence”: people inferred AI confidence from indirect cues even when the system did not explicitly communicate confidence.

That study is not about code review. It should not be overused. But it supports a practical product-design warning: if an interface does not show uncertainty, evidence, or scope limits, users will infer confidence from other cues.

In review, this matters. A polished pull request can hide weak assumptions. A neat summary can feel more reliable than the messy sources behind it. A generated plan can look more settled than it is. The reviewer is not only reviewing content. They are reviewing a confidence display, whether or not the product intentionally designed one.

The second failure mode is bad measurement. The Pragmatic Engineer’s piece on tokenmaxxing describes what happens when AI usage becomes a status signal: “great for AI vendors, bad for everyone else.”

The analogy to lines of code is obvious. If a team rewards volume, people can produce volume. If it rewards token use, people can burn tokens. If it rewards AI-generated pull requests, people can create review burden faster than the organization can absorb it.

Usage metrics can still help. They can show adoption, reveal bottlenecks, and identify runaway workflows. But they should remain diagnostic. Once they become status markers, they distort behavior.

The healthier metric is not how much AI output a team creates. It is how much useful work the team can safely absorb.

Trust Moves Toward Durable Value

The old model of review smuggled in a lot of trust through authorship. This engineer knows the codebase. This PM understands the customer. This designer catches the rough edges. The artifact carried some of the person’s reputation with it.

That trust still matters. It just no longer travels cleanly with the work.

A generated pull request may come from a careful engineer using an agent lightly, a PM using a coding tool for the first time, or an autonomous workflow that touched files nobody read line by line. A generated product plan may reflect deep customer knowledge, or it may reflect a plausible synthesis of weak notes. A confident implementation plan may be useful, or it may be a beautiful way to hide uncertainty.

The answer is not to invent a fake all-purpose agent trust score. The answer is to move trust closer to evidence and outcome.

What evidence says this change creates durable value for users? Can it be operated safely? Can it be changed again later? Did the user’s job get easier? Did the support burden fall? Did the workflow become more understandable? Did the system preserve enough evidence that the next person can improve it?

The code matters because it carries that value. The product matters because it shapes the user’s world. But the thing worth trusting is the whole path from intention to durable use.

The New Product-Engineering Contract

The change does not require a new committee for AI review. That would likely create another bottleneck.

It requires a clearer contract between product and engineering.

Product managers should bring more concrete, evidence-backed, reviewable artifacts. Engineers should create the standards and systems that make those artifacts safe to evaluate and absorb.

For product managers, that means taking more responsibility before handoff:

make the user problem clear;
show the evidence;
link the sources;
state the assumptions;
separate exploration from recommendation;
identify what needs engineering judgment;
make the artifact easy to reject, revise, or test.

Engineering takes more responsibility for the contribution system:

define what evidence belongs with different types of generated work;
create safe pathways for non-engineers to prototype;
make generated work inspectable;
automate checks where possible;
preserve technical coherence;
decide where human review must remain strong;
make rollback and ownership clear.

This is the role shift. Engineers need to let go of some first-draft control. PMs need to take on more of the path from idea to evidence.

The boundary between the roles does not disappear. It becomes more explicit. PMs should not pretend that AI-generated technical artifacts are ready for engineering absorption just because they look polished. Engineers should not treat every PM-created artifact as noise because it came from outside the traditional development path.

Both roles need to meet at the review surface.

The Queue Will Teach the Organization

AI adoption is running ahead of the operating model around it.

Teams can now produce drafts, code changes, plans, interfaces, reports, and prototypes faster than old review habits can absorb. That strain is easy to misread. The model looks unreliable. The team looks slow. The process looks bureaucratic. The missing piece is often the surface between generated work and owned work.

The review queue makes that surface visible.

It contains the pull requests that look finished, the prototypes that need user evidence, the plans nobody reads, the summaries that hide uncertainty, the token dashboards that reward activity, and the artifacts that either help humans inspect the work or quietly push them out of the loop.

It also contains the better future: the PM prototype that should become a feature, the designer’s interaction that no longer died in handoff, the support workflow that became an internal tool, the agent bug report with excellent reproduction steps, and the engineering harness that lets more people contribute without lowering standards.

Teams that design this surface will get more usable value from AI. They will know which artifacts need richer context, which outputs can stay disposable, which metrics distort behavior, and which generated work is not ready no matter how polished it looks.

The old version of AI adoption was a person asking a chatbot or coding agent for help and pasting the result into an existing process. The next version is more important. The process itself has to learn what evidence, ownership, and review look like when more people can build, more agents can act, and work arrives already dressed as finished.

The prize is not more code, more prototypes, more AI usage, or more convincing demos.

The prize is a better way to find durable value for users when the cost of making another thing keeps falling.

Resources

Highlights from my conversation about agentic engineering on Lenny’s Podcast, Simon Willison, April 3, 2026. Used for the framing that software engineering is a bellwether for other knowledge work, and for the shift from generation to testing, evaluation, and lived use.
Vibe coding and agentic engineering are getting closer than I’d like, Simon Willison, May 6, 2026. Used for the section on engineers inspecting less as agent output improves, and for the distinction between generated code that looks complete and work that has been properly owned.
After Automation, Dan Shipper, Every, May 21, 2026. Used for the argument that automation can create more human work, that agents still need human framing and judgment, and that work changes shape rather than simply disappearing.
The AI paradox: More automation, more humans, more work, Dan Shipper on Lenny’s Podcast, May 24, 2026. Used for the role-change framing around PMs, designers, humans and agents working together, and work increasingly happening inside tools such as Codex or Claude Code.
CTO Playbook: From SDLC to ADLC, The CTO Playbook, April 28, 2026. Used for the operating-model point that AI-assisted delivery needs review, testing, release discipline, and workflow changes around generation.
How I AI: Thariq Shihipar on Replacing Markdown with HTML for AI-Powered Development, ChatPRD Blog, May 18, 2026. Used for the section on artifact design, reviewability, and why visual or interactive artifacts can be better review surfaces than long generated documents.
People overestimate how confident AI systems are in their responses, Phys.org / Communications Psychology, May 17, 2026. Used narrowly for the warning that users infer confidence from indirect cues when systems do not make uncertainty visible.
The Pulse: “Tokenmaxxing” as a weird new trend, The Pragmatic Engineer, April 23, 2026. Used for the warning that AI usage metrics can become status games if leaders reward volume rather than useful absorbed work.
Harness Engineering 101, AI Daily Brief, April 13, 2026. Used for the term “harness” and for the point that model output becomes reviewable only when the surrounding tooling exposes context, traces, tests, logs, and failure modes.

AI Operating Systems by Mart Roosimägi

Discussion about this post

Ready for more?