Al Karakas
← All essays

AI programmes · Delivery

AI Programmes Don't Fail at the Model

11 min read

80% of AI projects fail to deliver business value. Enterprises spent $252 billion on AI in 2024, and 74% showed no measurable return. In 2025, 42% of companies abandoned most of their AI initiatives. None of those programmes failed because the technology stopped working.

80%
of AI projects fail to deliver business value, roughly twice the rate of comparable non-AI technology programmes (RAND Corporation)
$252B
spent on AI globally in 2024. 74% of organisations report no measurable return from the investment
42%
of enterprises abandoned most of their AI initiatives in 2025, up from 17% the year before (S&P Global Market Intelligence)

IBM Watson for Oncology spent four years and $62 million at MD Anderson Cancer Center. Not a single patient was treated before the programme was cancelled. The model had processed the medical literature. The natural language processing worked. The failure was organisational: the way physicians actually worked was incompatible with the way the system assumed they would work, and nobody had designed the adoption path between the two.

Zillow's iBuying algorithmic pricing system generated a $304 million operating loss in a single quarter, the shutdown of an entire business line, and 2,000 redundancies. The algorithm had processed millions of property transactions. It worked in the conditions it was trained on. It failed because governance of the system, the monitoring, feedback loops, and escalation paths when the model began drifting, did not keep pace with deployment.

These are not edge cases. They are high-profile versions of the same pattern that appears in programmes at a fraction of that scale: a technology that functions, surrounded by a delivery structure that does not.

The uncomfortable truth

Every failing AI programme I have been brought in to recover had a working model at its centre.

That is not a rhetorical flourish. It is the single most consistent finding across the engagements I have stabilised. The model produced plausible output. The retrieval returned relevant documents. The classifier hit its accuracy target on the test set. By every measure the team was actually watching, the technology worked. And the programme was still days from a termination decision.

This is the failure pattern most AI post-mortems miss. They reach for the model because the model is the new and unfamiliar thing, and unfamiliar things feel like the obvious place for risk to live. The analysis fixates on hallucination rates, prompt engineering, model choice, fine-tuning. Meanwhile the actual failure sits exactly where it sits in every other technology programme that has ever collapsed: in the delivery structure wrapped around the thing that works.

There is a specific reason a working model makes things worse, not better. When the technology demonstrably functions, everyone assumes the problem must be elsewhere. The team is confident. Management is reassured. The client waits. Nobody examines the delivery structure, because nobody thinks that is where the problem is. By the time someone does look, the budget is half-spent and the client is assembling the documentation for a termination conversation. A working model is not a green flag on programme health. Without the delivery structures around it, it is cover for an unexamined failure in plain sight.

The failure taxonomy

When I diagnose a stalled AI programme, the same five structural gaps recur. None of them is about the model.

No data quality ownership. Ask "who is accountable for the quality of what goes into this system" and you get silence, or you get three people who each assume it is one of the other two. AI programmes are uniquely exposed here because the input is not a clean schema someone designed. It is messy operational data nobody has owned end to end. The model faithfully learns from, or retrieves over, whatever it is given. Garbage in is not a cliché in this context. It is the mechanism of failure, and it is invisible until someone evaluates output against ground truth, which nobody has been resourced to do.

No adoption design. Somebody decided to build the capability. Almost nobody decided, deliberately and early, how a real user would fold it into a real workflow. Adoption gets treated as a downstream consequence of a good model rather than a designed outcome with its own owner, its own milestones, and its own tracking. The system goes live into a workflow that was never adjusted to receive it, usage flatlines, and the programme is judged a failure of technology when it was a failure of operational design. The Watson cancellation was this pattern at scale: the model worked, the clinical workflow did not change to receive it, and so the capability was never actually used.

No governance feedback loop. Who monitors what the model produces after go-live? In a traditional system, behaviour is deterministic. You test it once and it stays tested. An AI system drifts. The data distribution shifts, the use case evolves, the upstream source changes format, and output degrades silently because nobody owns the evaluation cadence. The Zillow system drifted into a housing market that no longer matched its training data. Nobody caught it until the write-down.

No backlog. AI teams often come out of a research or data-science culture where work is exploratory by nature. That mode is genuinely valuable for reducing technical uncertainty. But it has no native connection to delivery. There is no prioritised, estimable list of increments that ladder up to a committed outcome. The team is busy, the team is producing, and there is no way to answer the only question the client actually cares about: are we on track to deliver the thing we are paying for?

The use case shifts and nobody catches it. The client's understanding of what they need evolves as they see early output. That is healthy. What is fatal is the absence of a feedback loop between delivery and that evolving understanding. The team keeps building against the original brief while the target quietly moves, and the gap is only discovered when the budget is nearly gone and the deliverable lands wide of what is now wanted.

I want to ground this in specifics, because taxonomies are cheap.

On an Azure OpenAI platform recovery I inherited at one month in, roughly £20,000 had been spent and I could trace approximately £1,200 of delivered value. The model was working. There was no backlog, no governance, no reporting, no ceremonies, no DevOps pipeline. The team was not idle and the technology was not broken. There was simply no structure converting activity into outcomes.

On a £350,000 AI and data platform engagement, the client had assembled a twenty-page critique of the programme by month two. The contract was sized for forty days of tangible, demonstrable output. What had actually been delivered was a solution design document and a handful of workshops. No development team had been stood up. No code had been written. The programme manager was managing other accounts and was not present in the delivery. The client had not been brought into any meaningful review cadence. The engagement had consumed most of its early budget producing artefacts that described what would be built, while the building itself had not started. No backlog. No definition of done. No sprint structure. No team on the ground.

On a £120,000 data platform programme, the team was diligently building the wrong thing. Not because anyone was incompetent, but because no feedback loop connected delivery to the client's sharpening sense of their own requirement. The real need was identifiable ahead of its formal articulation. It just needed someone to pause the burn, run a scoped workshop, and refocus delivery inside the existing funding before the money was gone.

Three different technologies. Three different clients. The same structural void every time.

The numbers say this is systemic

If this were three unlucky engagements, it would be anecdote. It is not.

2024

17%

of enterprises abandoned the majority of their AI initiatives

2025

42%

Nearly tripled in twelve months
+ 46% of PoCs scrapped before production

Source: S&P Global Market Intelligence, 2025 survey of 1,000+ enterprises

S&P Global Market Intelligence's 2025 survey of over a thousand enterprises found that 42% of companies abandoned the majority of their AI initiatives that year, up from 17% in 2024. The abandonment rate did not creep. It nearly tripled in twelve months. The same study found organisations scrapped 46% of AI proof-of-concepts before they ever reached production.

McKinsey's State of AI work puts the governance gap in sharp relief: 72% of organisations report AI in use, while fewer than one in five has an enterprise body with real authority over responsible AI decisions. Deloitte's 2026 work on autonomous agents finds only about one in five companies has mature governance of agentic AI, even as agents move into production.

The adoption curve and the governance curve have separated. That is not a technology problem. It is a measure of how far deployment has outrun the structures that make deployment survivable.

Read those together. The overwhelming majority of organisations now run AI in production with governance that does not meet their own definition of mature. A 42% abandonment rate is what it looks like when a discipline scales its deployment far faster than it scales the structures that make deployment survivable.

What the first 72 hours of a recovery actually look like

When I am brought into a programme recovery, people expect something exotic for an AI engagement. It is not exotic. It is the same diagnostic discipline I would apply to any failing programme, plus three AI-specific questions bolted on.

The standard pass first. Is there a backlog, and does it ladder to a committed outcome? Is there a definition of done? Is there a reporting cadence that tells both sides the truth? Are there ceremonies that surface problems early rather than at month-end? Is there a RAID log that anyone looks at? In a failing AI programme the answer to most of these is no, and that alone explains most of the wreckage.

Then the three questions specific to AI:

Who owns data quality, by name? Not which team. Which person can be held to account for the quality and lineage of what feeds the system. If the answer is a team, it is nobody. This is the Watson question: not did the model learn correctly, but what did you give it to learn from, and who is accountable for that input.

Where is the model evaluation cadence? How is output sampled, against what ground truth, how often, and who reads the result? If evaluation happened once before go-live and never since, the system is drifting blind. This is the Zillow question: not did the algorithm work at launch, but who was watching it after launch, and what happened when the market moved.

What is the definition of done for an AI deliverable? "The model produces output" is not done. Done specifies the quality bar, who validates against it, the adoption hook into a real workflow, and the monitoring that keeps it honest after launch. If none of those elements are in the definition, you do not have a definition.

In the £20k-to-£1.2k recovery, installing exactly this skeleton: a backlog, a definition of done, a reporting cadence, named ownership, was the entire intervention. The model never needed touching. Stabilising delivery and installing minimum viable governance was sufficient to turn the engagement around and, across that body of work, to convert delivery confidence into £2m of extensions. None of that came from improving a model. All of it came from making delivery legible.

Why it keeps happening

If the fix is this knowable, why does the industry keep walking into the same hole?

Flyvbjerg's answer is the planning fallacy, and it is amplified in AI by a specific cognitive trap. Optimism bias is bad enough on its own. AI adds strategic misrepresentation, where vendors and internal champions both have incentives to understate complexity and compress timelines, and then layers on something more corrosive: the rejection of the reference class.

The reference class is the set of comparable past projects whose actual outcomes should anchor your forecast. Flyvbjerg's central prescription is to forecast from that outside view rather than from your own inside-view estimate, because the inside view is reliably, predictably optimistic. AI programmes reject the reference class wholesale, because the technology feels unprecedented. "This is different. The old rules about backlogs and governance and change management were for the old kind of software." So the hard-won lessons of every prior technology programme are discarded precisely when they are most needed.

Amy Edmondson's research on psychological safety explains the silence around the warning signs. AI programmes fail quietly. The team often senses early that the model is underperforming, the data is worse than the pitch assumed, or the use case has moved. In a low-safety environment, where the AI initiative is the executive sponsor's flagship and doubt reads as disloyalty. Those signals do not surface. What gets recorded later as an AI failure was, in real time, a psychological safety failure. The information existed. The structure to surface it did not.

The discipline Ken Schwaber and Jeff Sutherland built Scrum on: empirical process control, resting on transparency, inspection, and adaptation, is not optional for AI. It is more necessary than for conventional software, because AI output is probabilistic and drifts. The NIST AI Risk Management Framework (2023) and ISO/IEC 42001 codify the same instinct: data quality and lineage, model performance monitoring, human oversight design, and bias review are distinct governance domains that traditional software governance does not cover.

What mature AI programme governance actually looks like

Not "implement an AI governance framework." That sentence has launched a thousand stalled initiatives. Here is what it means concretely, expressed as artefacts and decisions a programme either has or does not.

A named data owner. One person, accountable for the quality and lineage of the data feeding the system, with the authority to fix it. Documented lineage that lets you trace any output back to its sources, so that when quality slips you can find where.

A model evaluation cadence on the sprint clock. A defined sample of live output, scored against a defined quality bar, every sprint, by a named reviewer, with the result on the same board as every other delivery metric. Drift becomes a visible trend, not a surprise.

Adoption tracked from day one. Usage and workflow integration carried as first-class delivery metrics from the start, not measured after go-live to find out whether anyone showed up.

A RAID log that carries AI-specific risks. Data drift, evaluation gaps, the use case shifting, silent degradation, oversight gaps on anything agentic. These belong in the log alongside the ordinary delivery risks, reviewed on the ordinary cadence.

A definition of done that includes the quality bar, the validation step, the adoption hook, and the post-launch monitoring. "Model produces output" never appears in it.

Three questions to ask in the first week of any AI programme

Question one

Who, by name, owns the quality of the data going into this system? If the answer is a team, a function, or a shrug, you have found your most likely failure point before you have spent the budget. IBM Watson for Oncology failed this question at $62 million. The question costs nothing to ask in week one.

Question two

Show me the model evaluation cadence. Not the pre-launch test results. The repeating, scheduled process by which live output is sampled and scored after go-live, and the name of the person who reads it. If it does not exist, the system will drift and you will not know until a user complains.

Question three

What is the definition of done for the next deliverable, and does it include adoption and monitoring? If done means the model runs, you are measuring activity, not outcomes, and the gap between the two is where programmes die.

The opportunity hiding inside the failure rate

Here is the part that should be encouraging rather than grim. If AI programmes were failing because the technology did not work, that would be a hard problem with an uncertain horizon. They are not. The technology works. The models at the centre of the wreckage I have recovered were, almost without exception, fine.

The failure point is the delivery structure, and delivery structure is a solved problem. We have known for decades how to run a programme: a backlog, a definition of done, named ownership, a reporting cadence, a feedback loop, and the honesty to surface bad news early. The AI-specific additions, data quality ownership, an evaluation cadence, adoption design, AI risks in the RAID log, are a modest extension of that known discipline, not a new science.

42%
The 42% is not a verdict on the technology. It is a measure of how far deployment has outrun delivery maturity, and that gap is closable with practices that already exist. The model is not the hard part. Running the programme is, and running programmes is something we already know how to do.

The organisations that internalise that, that treat an AI programme as a programme first and an AI initiative second, will quietly move from the 42% that abandon to the minority that ship. The technology is ready. The programmes just need to be run.