Why 95% of GenAI Pilots Fail — and What the 5% Do Differently
MIT puts the failure rate at 95%. RAND puts it above 80%. Gartner forecasts that 40% of agentic AI projects will be canceled by 2027. The numbers are real, the failure modes are consistent, and the difference between the 95% and the 5% is not the model. It's the execution discipline around the model.
What the numbers actually say
The "AI pilots fail" headline has been around long enough that it sounds like a cliché. The underlying research isn't a cliché. It's been independently reproduced by three different institutions in the last eighteen months, with consistent findings.
MIT NANDA, August 2025. The State of AI in Business 2025 report — "the GenAI Divide" — was based on 150 leader interviews, 350 employee surveys, and analysis of 300 publicly documented AI deployments. The headline finding: 95% of corporate generative-AI pilots produced no measurable P&L impact. Roughly 5% achieved rapid value; the rest stalled. The report's most consequential secondary finding was that initiatives purchased from specialized vendors succeeded about 67% of the time, while initiatives built internally succeeded about 33% of the time.
RAND Corporation, 2024. The Root Causes of Failure for Artificial Intelligence Projects (RR-A2680-1) was based on 65 in-depth interviews with senior data scientists and engineers, each with five or more years of AI experience. RAND concluded that more than 80% of AI projects fail — roughly twice the failure rate of regular IT projects. They named five root causes, and we'll come back to each one below.
Gartner, 2024–2025. Two separate press releases. In July 2024, Gartner predicted that 30% of generative AI projects would be abandoned after proof-of-concept by end of 2025. In June 2025, Gartner forecast that more than 40% of agentic AI projects would be canceled by end of 2027. (Worth noting: the often-cited "85% of AI projects fail" line traces back to a 2018 Gartner press release predicting that 85% would deliver erroneous outcomes due to data bias — not outright failure. The popular press has weakened the original wording; we should not.)
So the honest headline is: depending on which study and which definition, roughly 70 to 95 percent of GenAI pilots fail to produce measurable business return. The numbers are directionally consistent. They are not noise.
The five root causes — RAND's framing
These map almost directly to what we've seen in real engagements. The order matters: the most common failures are at the top, not at the bottom.
The wrong person owns the project
The AI initiative is owned by the CIO, the innovation team, or a Chief AI Officer who reports to nobody who actually does the work. The line managers whose process is being changed weren't consulted, don't want it, and are quietly waiting for the project to fail. The model can be perfect; the rollout dies in the org chart.
Garbage in, AI out
The underlying data — transaction histories, document libraries, ticket archives — is partial, inconsistent, or labeled wrong. The model is trained or tuned on this data and inherits all of its flaws. RAND notes that the work of cleaning up the data is consistently underestimated by 5–10x in scope and timeline. Most pilots die here.
The plumbing isn't there
The model works in a notebook. Getting it into production — connected to the right systems, with monitoring, with rollback, with audit logs, with permissioning — turns out to be the actual project, and nobody scoped it. The pilot is left as a Jupyter demo nobody can use without the data scientist who built it.
The KPI was set by marketing
The board was told the AI would cut costs by 50%, replace the support team, or 10x revenue. None of those are honest claims for a 90-day pilot. When the actual gain is 14% productivity (the Brynjolfsson NBER number, which is the strongest in-production result for customer ops in the literature), the pilot is judged a failure against a number it was never going to hit.
The clock runs out before the curve bends
RAND notes that AI projects often have a longer lead-in than IT projects — the model needs to be tuned, the feedback loop needs to run a few cycles, the team needs to learn what to override. Pilots judged at 90 days are usually killed just before the curve bends. The 5% that survive are usually given 6–12 months with checkpoint metrics, not a 90-day kill switch.
The "jagged frontier" problem
Dell'Acqua et al. (HBS/BCG, 2023) called it the "jagged technological frontier" — AI excels at some tasks and is worse than nothing on others. The vendor demo showed the tasks it excels at. Production is full of the other ones. The consultants in the BCG study who used AI on tasks outside the model's capability frontier were 19 percentage points less likely to produce correct answers. Failure here looks like accuracy regression, not refusal to function.
What the 5% who succeed do differently
From the MIT NANDA data, from the RAND interviews, from our own engagements, the pattern of the successful 5% is remarkably consistent. Five disciplines. None of them are about the model.
Scope brutally narrow
One workflow. One named team. One measurable outcome. The successful pilots are not "transform our finance function." They are "reduce our invoice-processing time from 12 minutes to under 3" or "cut tenant first-response time from 4 hours to under 30 minutes." You cannot measure transformation. You can measure invoice processing time.
The line owner runs the project
The person who has actually done the work, who knows where the exceptions live, who has the credibility with the team being affected. Not the CIO. Not the head of innovation. The line manager. MIT NANDA's data on this is unambiguous: pilots owned at the line-manager level succeed at multiples of the rate of pilots owned by central innovation teams.
Specialized vendor, not generalist build
The 67% vs. 33% success rate gap MIT NANDA reported is one of the most striking findings in the study. Build-from-scratch AI inside a company that doesn't make AI products is generally a bad bet. Buy from a vendor who lives in your specific problem space. (Or, if you are the vendor, this is exactly why operator-specialist AI shops outperform generalist consultancies.)
Honest baseline before turning anything on
Measure the current process — really measure it, with a stopwatch — before the AI is deployed. Otherwise the gain is unprovable, the project gets cut by the next quarter's budget review, and the AI was real but nobody can defend it. The first deliverable in every successful pilot we've seen is a baseline measurement document.
Human in the loop where the cost of error is high
Not as a bottleneck, but as a designed control point. The successful pilots don't try to fully automate the workflow on day one. They automate the high-confidence majority and route the rest to a human, and they measure the override rate as a system-health metric over time. As accuracy improves, more of the workflow runs autonomously. The systems that fail try to skip the human step before the data justifies it.
Evaluation is a first-class deliverable
Every successful AI deployment we've seen has a test set, a measured accuracy, an evaluation harness that runs on every model update, and an alert when accuracy drifts. "It worked in the demo" is not acceptance. Measurable accuracy on real data is. This is the single most-skipped step in failed pilots — and the cheapest one to add at the start.
The reframe: why this is a competitive moat, not a problem
The 95% failure rate sounds like an indictment of AI. It isn't. It's an indictment of how most organizations approach AI projects — the same way they approached big IT projects in the 2000s. A central transformation team, a vendor with a demo, a 90-day pilot, a status report. That model produces the same failure rate it always did for big IT initiatives; AI just made the failures more visible because the demos look so compelling.
The reframe: the 5% who succeed are doing something different, and the difference is execution discipline plus operator domain knowledge. Neither of those scales by hiring more data scientists. Both of them scale by working with people who have actually done the work you are trying to automate. This is exactly why we describe AMG as an operator company that uses AI as a tool, not an AI company looking for a use case. The technology is commodity. The execution is the moat.
For a small or mid-sized business owner reading this, the implication is direct. You will not see the 95% failure rate if you do five things: pick one workflow, put yourself or your line lead in charge of it, work with a vendor or partner who has operated in your domain, measure the baseline before you turn it on, and design a human review path that you tighten over time. The technology will work. The execution discipline is the differentiator.
What this looks like in our engagements
We don't run "AI pilots." We run scoped operational improvements that happen to use AI where it changes the answer. Every engagement starts with the same three deliverables:
- Baseline document. Current process, current time, current error rate. Measured, not estimated.
- Outcome spec. What good looks like. Specific numbers, agreed up front, not aspirational ones.
- Eval set. A labeled sample of real cases the AI will run against, so accuracy is measurable on day one and tracked over time.
That's the discipline. Everything else — model selection, integration design, human-in-the-loop placement, deployment, measurement — flows from those three deliverables. It's not exciting. It is, in the MIT NANDA framing, the actual differentiator between the 5% that succeed and the 95% that don't.
What this means for AMG clients
If you are evaluating AI for your business — whether through a vendor, an internal effort, or a consulting engagement — the most important question is not "which model" or "which platform." It's "who owns this, what's the measurable outcome, and where does the human review live?" If those questions don't have crisp answers, you are looking at a 95% project. If they do, you are looking at a 5% project.
We'd rather have a 20-minute conversation about the workflow you're thinking about before you spend money than help you fix a failed pilot after the fact. That's how the 5% gets built — not by being smarter about models, but by being honest about execution before the model is even chosen.
Thinking about an AI initiative?
Send a short note describing the workflow. We'll tell you honestly which discipline gaps would put it in the 95% — and what would put it in the 5%.