How an AI-Native Bank Gets Built

Jun 14, 2026

I’ve restarted writing here in Substack. Different professional path now, different articles. I’ve relied heavily on Codex when writing this and I’ll continue doing so throughout the writing. The outputs reflect what themes are top of mind for me at any given time + I’m heavily vested in getting a grip how AI generated output is of good quality.

This is a personal publication. Views expressed below are my own.

In late 2023, UK-based Allica Bank started working on agents for complex credit analysis.

The first results came quickly. The models could reach 85 to 90% accuracy on tasks that involved financial documents and unstructured loan applications. That was enough for a convincing demonstration. It was nowhere near enough for a bank.

The following year was frustrating. Allica writes in its account of scaling AI inside the bank that moving from 85 or 90% accuracy into the high 90s required an order of magnitude more work. The team tested different models, split prompts into modules, required structured outputs, compared results with ground truth, and used independent models to validate one another. When they disagreed, the case went to a person.

By mid-2025, the bank had agents running in production. One of them handles loan applications that arrive as unstructured email. It checks whether the documents are complete, asks the broker for missing information, extracts and analyses the material, calls Allica’s decisioning engine, and returns the decision by email.

During its initial production period, the system decided 50% of received cases end to end. The mean response time was 12 minutes. The fastest took seven.

The model is the least interesting part of that story.

The system had to know which documents to trust, which fields were required, which credit policy applied, which action it could take, and when uncertainty required referral. It needed security review, model-risk review, ground-truth validation, shadow deployment, drift detection, production monitoring, and an evidence trail. It also had to fit the broker’s existing workflow. An accurate agent sitting outside the lending process would have delivered, in Allica’s words, close to zero value.

Simon Taylor’s AI Operating Model Playbook uses Allica as one of its main examples of a company moving beyond personal AI adoption. The distinction is easy to miss. A bank can buy enterprise licences, launch copilots, and report high usage while the organization around the tools remains unchanged. Allica changed the organization around the work.

That raises a harder question than adoption: who has to build the rest of the bank?

The Useful Ideas Did Not Start In The Boardroom

Allica did not begin its company-wide adoption by choosing one tool and issuing a mandate. It licensed a wide set of models and products, connected them to systems such as Jira, Figma, Confluence, HubSpot, and Miro, and used peer champions inside functions to spread working practices.

The bank also ran a monthly competition for employee-built use cases. The prizes were Amazon vouchers. The results included an agent for extracting structured feedback from customer calls and email, an assistant trained on Allica’s writing style, and an agent for investigating the root cause of production bugs.

Allica Bank:

“Bottom-up use cases beat top-down ones.”

These were not applications a central strategy team had specified. They came from people who already knew where work repeated, which information was hard to assemble, and which delay irritated customers or colleagues.

The same pattern appears at much greater volume in Taylor’s account of Ramp in The AI Operating Model Playbook 2026. Ramp has built a central harness called Glass, connected it to around 30 workplace tools, and opened the resulting capability to employees. Taylor reports that more than 800 people built over 1,500 internal applications in six weeks.

The examples are small and specific. A finance employee built a contract reviewer that reportedly saves 45 minutes per contract. A sales operations employee replaced a spreadsheet-based compensation process across three organizations in 48 hours. An L&D employee created a training simulator in 15 minutes.

A central AI team could have built each of those applications. It is difficult to see how it could have identified, prioritized, and delivered all of them at the pace of 800 people working inside their own problems.

This is why the modern AI operating model has to be decentralized. The scale and variety of the useful work sit at the edges of the organization. Lending knows lending. Fraud investigators know where cases stall. Customer service sees which exceptions force a second call. Finance knows which contract terms still require an analyst to open three systems and a spreadsheet.

Central planning turns those details into requirements. Local teams begin with the details.

Building costs change the trade-off. When an internal application required a formal project, a dedicated engineering team, and months of implementation, duplicate development was expensive. AI lets a team test a narrow application before a central programme has finished gathering requirements.

Some teams will build similar things. Some experiments will fail. That waste is real. Allica warns that organic adoption can produce overlapping solutions, higher costs, and security risks.

Waiting is also waste. It appears as queue time, coordination, generalized requirements, and a large solution delivered after the local need has moved. The choice is not between duplication and efficiency. It is between different kinds of waste.

Allica’s answer is revealing: unify the bottom-up work through a coherent governed platform. Do not move all the building back to the centre.

Allica Changed More Than Its Tooling

The lending agent did not emerge from a separate AI lab while the rest of product and engineering continued as before.

Allica rebuilt its design system around components that coding agents handled well. The new system, called Alchemy, reportedly reduced page build times by 89%. It created a shared GitHub library where prompts and agent instructions could become versioned skills that other teams could improve.

The bank also changed team shape. Backend, frontend, and QA work moved into broader full-stack roles. Designers began changing interface code, raising pull requests, and running tests. Typical squad size fell by roughly 25%, with some “squadlets” containing one product representative and one full-stack engineer.

Smaller teams did not mean looser engineering. Staff engineers encoded their judgment into test contracts, lint rules, shared skills, security scans, and CI/CD gates. Allica describes the surrounding harness as the condition that makes agent-written code and contributions from non-engineers safe enough for a regulated institution.

The reported output increased sharply. Quarterly merged pull requests rose from about 1,900 in the first quarter of 2025 to about 7,150 a year later. Across 2025, the bank says its internal teams shipped more than 3,700 production releases while maintaining platform uptime above 99.5%.

Those numbers do not establish customer value on their own. Allica acknowledges the problem and says it wants to move toward measuring positive product increments rather than only merged pull requests. The lending workflow is stronger evidence because the result reaches a customer-visible outcome: an application that used to take days or weeks can return a decision in minutes.

This is what makes the case more than an AI productivity story. Team roles, shared infrastructure, development standards, model-risk work, and a customer workflow changed together.

Azeem Azhar explains why that combination matters in Why AI Isn’t Showing Up on Your Bottom Line. A fast team creates little enterprise value when its output waits at the next handoff. His example is an equity analyst who can update a price target continuously while compliance, publishing, and customers still operate at the old speed.

The local gain becomes congestion.

Allica shortened the whole loop. The email, documents, analysis, credit policy, decision, evidence, and broker response became one production path. The bank did not add a copilot beside the old queue.

The Centre Still Has A Product To Build

Decentralization can easily become a euphemism for shifting work and risk onto domain teams. Giving every function an API key and asking it to innovate would create more applications, but it would also create hidden data access, inconsistent controls, duplicated costs, and systems nobody knows how to operate.

The stronger model is federated. Domain teams build and own applications. Central teams build the conditions under which those applications can reach production.

Taylor describes Ramp’s structure as “build from the centre, drive from the spokes.” The central team provides the models, connectors, data access, knowledge plumbing, and shared harness. Functional teams build applications on top of it, and their production experience determines what the central platform builds next.

BBVA offers a banking version at a different scale. In Taylor’s account, its AI Factory validates base models, builds secure APIs, and establishes common compliance protocols. Country and business teams consume that infrastructure and adapt it to local workflows. Employees have reportedly created more than 3,000 custom GPTs, while a bank hackathon produced 315 working prototypes from non-engineers.

BBVA has not changed as completely as Allica. Taylor places it lower in his maturity model because the new team shapes and workflows are not yet universal. That contrast helps. A large bank can distribute access and experimentation before it has distributed full production ownership.

The central team therefore has a product of its own: the governed path from a local idea to a monitored application.

That path includes identity, approved model access, connectors, deployment templates, observability, evaluation services, audit evidence, cost attribution, and a registry of reusable components. A fraud team should not invent token accounting. A servicing team should not design its own secret-management system. A lending team should not negotiate the same data-retention rule from the beginning.

The centre removes repeated infrastructure work. It does not decide every local application.

Compliance Cannot Remain A Meeting

The control model has to change with the development model.

Traditional compliance and risk work often begins with a document. A specialist reads it, asks questions, adds comments, and schedules a discussion. The team revises the design and returns for another review.

That process can handle a small portfolio of large projects. It cannot follow thousands of applications and frequent releases without becoming the next queue.

Allica’s journey from a promising 85% demo to a production lending system shows what replaces the meeting. The team created high-quality test samples, used dual-model validation, ran ground-truth checks before release, deployed in shadow, monitored live results, checked for drift, and required security, model-risk, user, and second-line review.

The controls did not disappear. Many moved into the path the system had to pass.

Mohit Goyal reaches the same conclusion from the engineering side in his account of building an agentic harness.

“Safety has to be enforced outside the prompt.”

His system does not merely ask the model to behave. It removes write tools in planning mode, applies approval policies before actions run, marks external tool output as untrusted, records state, stops repeated loops, and tests those boundaries without making a model call.

A bank can use the same separation. The model may interpret an unstructured document. Deterministic software can validate required fields, enforce an entitlement, apply a known policy threshold, or block an unsupported action. An eval can test model behavior against historical cases. A trace can record the data, model, tools, and policy version behind a consequential decision.

The Return on Tokens thesis adds an economic reason for the same architecture. Model reasoning can help discover a rule in messy work. Once the rule becomes stable, repeated execution may belong in deterministic code rather than another expensive inference loop.

Not every rule will become software. Regulation contains ambiguity. New products create new risks. An unusual lending case may still need expert judgment.

Human control becomes more valuable when it is reserved for those cases. Specialists should work on new risk classes, incidents, uncertain policy, and changes to the control system. They should not repeatedly type the same interpretation into review comments.

Compliance becomes part of the platform: versioned policies, automated checks, evidence schemas, approval boundaries, monitoring, and a clear route back to a person.

The First Application Changes The Next One

The practical starting point is one single workflow.

Imagine a team beginning with the email lending path. The product manager sits with brokers, underwriters, and operations staff to reconstruct what actually happens. Which documents arrive? What is usually missing? Which policy rule resolves a case cleanly? Which exception requires judgment? Where does the broker wait without knowing why?

The engineer turns that path into states the system can observe: received, incomplete, extracted, validated, referred, decided, returned. They identify the systems of record, permissions, APIs, and deterministic controls. Together, product and engineering define the evidence that must accompany each decision and the point where autonomy stops.

The first release will not decide a loan. It can check completeness or prepare a recommendation. It can run in shadow against live work and compare its result with the existing process.

Allica’s experience gives this sequence weight. Its initial models looked strong and still failed the production threshold. The team learned which document types needed separate handling, where two models should agree, and which disagreements should trigger referral. Accuracy came from the loop around the model.

Production then creates new material. A rejected decision becomes a test case. A missing field improves the schema. An incident changes a policy check. A recurring broker question changes the response. The workflow leaves behind reusable skills, connectors, evals, traces, and runbooks.

Other teams should be able to find those assets.

Taylor reports that Ramp’s skills marketplace contained more than 350 shared skills and reached 700 daily active users within a month. The marketplace was built by a team of four in three months.

Taylor attributes one line to Ramp’s internal AI work:

“The failure mode wasn’t that people couldn’t figure things out. It was that everyone had to figure it out alone.”

Shared learning is the centre’s second product. A corrected control should travel through tooling rather than another presentation. A production incident should leave a regression test. A good extraction pattern should become a component. A better eval should reach every dependent application.

This also reduces hidden human work. In Glean’s vendor-sponsored Work AI Index 2026 (see also YouTube), employees reported spending about 6.4 hours a week feeding context, debugging, and cleaning AI output, while roughly 36% of sessions failed. The exact figures are self-reported, but the failure mode is concrete. When the system does not carry context and working patterns, people carry them manually.

Compliance specialists can become the integration layer in the same way. If every local team needs a person to translate the same policy into bespoke comments, scarce expertise disappears into repetition.

Autonomy Has To Be Adjustable

Decentralized ownership does not require every application to start with the same freedom.

Cristina Cordova describes this range in Linear’s product workflows. Some customers want to approve every routing suggestion. Others want the system to complete the work and return when a pull request is ready.

The task decides the boundary.

An internal assistant that relabels work items can act with wider autonomy than a system that changes a customer’s credit decision. A reversible formatting change differs from a payment. A workflow with clean ground truth differs from one whose experts still disagree about the right answer.

Autonomy should expand through production evidence. The team starts with recommendation or shadow mode, observes disagreements and failures, narrows the uncertain cases, and opens the action boundary when the evidence supports it.

Resource controls should follow the same path. Model routing, caching, batching, defaults, attribution, and limits belong in the shared platform. NVIDIA’s tokenomics discussion frames the measure correctly: token cost and throughput have to connect to business value.

A domain team should see what its workflow costs and which results survive review. It should not be rewarded for consuming more tokens or launching more agents. The lending system earns its cost when it returns good decisions faster, reduces manual work without hiding review elsewhere, and improves from the cases it could not handle.

The Trade Is Speed For A Different Kind Of Control

Federated systems are untidy. Two teams may build similar extractors. A local application may remain in use after a better shared component appears. Costs and permissions can spread faster than central teams expect.

The platform needs an inventory of applications, owners, data access, spend, production use, and outcomes. It needs retirement paths as well as deployment paths. Central teams can consolidate patterns after local experiments prove which one works.

Central development looks tidier. It can still waste more. Requirements are averaged across teams. Priorities compete in one portfolio. The application arrives later and carries a larger sunk cost before users have tested it.

Ukraine’s defence-technology system provides a brief high-stakes comparison from a radically different setting. CSIS describes how more direct procurement authority helped units respond to varied frontline needs. The official Brave1 marketplace surrounds that local choice with catalogues, supplier access, reviews, codification, and feedback.

Banking is not warfare. The comparison is about topology. When local needs change faster than a central planner can understand them, the centre can provide standards, visibility, approved markets, and learning infrastructure while people close to the problem choose and adapt the solution.

The traditional control function loses some authority over which applications are built. It gains direct influence over every application that uses its policy service, evidence schema, permission system, or monitor.

That is a different kind of control. It travels through the system.

The Operating Model Is What Scales

The broker’s email did not become a 12-minute decision because Allica found one good prompt.

The bank spent years building proprietary software and clean data. It changed its design system, engineering roles, team sizes, tests, skills, and release controls. It learned through a failed accuracy phase. Product, engineering, operations, security, model risk, users, and the second line all shaped the production path.

The lending team understood the decision. Central capabilities made the decision safe to automate. Production taught both groups what to change next.

That pattern can scale across a bank. The central unit supplies connectors, identity, policies, evidence, observability, cost controls, and reusable learning. Domain teams turn those capabilities into lending, fraud, onboarding, servicing, finance, and operations applications.

Customers will never care how many agents the bank launched. They will notice that the loan answer arrived sooner, the fraud case was resolved correctly, or the complaint stopped moving between queues.

An AI-native bank gets built from both directions: capability from the centre, applications from the edges, and evidence flowing between them.

References

The AI Operating Model Playbook 2026, Simon Taylor / Fintech Brainfood, June 11, 2026.
The Intelligence Company Gets Built, Simon Taylor / Fintech Brainfood, May 17, 2026.
Scaling AI at Allica, Allica Bank, April 30, 2026.
Why AI isn’t showing up on your bottom line, Azeem Azhar, June 4, 2026.
Babysitting the Machine, Rebecca Hinds / The Cognitive Revolution, June 10, 2026.
Work AI Index Report 2026, Glean, June 2026.
I Built an Agentic Harness From Scratch, Mohit Goyal, June 7, 2026.
Rebuilding the Product Dev Lifecycle for Teams and Agents, Cristina Cordova / Product School, June 10, 2026.
How I Cut Our AI Model Spend in Half, OnlyCFO, June 9, 2026.
Return on Tokens, Packy McCormick and Markie Wagner, June 10, 2026.
Inside AI Tokenomics, NVIDIA AI Podcast, May 20, 2026.
How Ukraine Rebuilt Its Military Acquisition System Around Commercial Technology, CSIS.
Brave1 Market, Brave1.

AI Operating Systems by Mart Roosimägi

Discussion about this post

Ready for more?