Why enterprise AI reliability breaks down at scale

Something went wrong with a Woolworths' AI chatbot, Olive earlier this year.

A customer had called the Australian supermarket giant, got through to Olive (Wooldworth’s AI assistant) and asked a simple question. Olive gave them an answer. Then rambled about her ‘mother’.

Further testing revealed that Olive was making pricing errors across multiple basic items.

For a company already facing proceedings over allegedly misleading discount practices, an AI confidently quoting the wrong prices isn’t just a quirky glitch. It’s a serious problem.

So what went wrong?

A likely answer is that Olive is powered by a large language model connected to Woolworths' product data*. In theory, it should know what things cost.

What the errors show is a gap between what the AI was serving up and what was actually true in the system. The AI had no way of knowing the difference.

The architecture behind most systems like this has a name. It’s called "RAG" – Retrieval Augmented Generation. And it's built into the way many business tools connect AI to your data. It’s also one of the most common failure modes in enterprise AI.

It works when data is clean, bounded and relatively static. Most businesses aren't any of those things.

What is RAG, and why does it matter?

RAG is a technique that lets an AI search your documents before answering a question, so the response draws on your data rather than just its general training.

A standard AI model (the kind powering ChatGPT and similar tools) only knows what it was trained on: a vast sweep of internet content, frozen at a point in time. It knows nothing specific about your business.

RAG is supposed to fix that. It isn't a product you buy or a setting you switch on. It's a technique built into most AI tools that connect to your business data.

When your AI searches your documents before answering a question, that's RAG at work. The answer is supposed to come from your data, not from general training.

In theory: your sales team (and chatbot) gets current pricing. Your new hire gets the right onboarding process. Your finance team gets accurate invoice summaries. Your customers get the actual refund policy.

That's the theory. It works well with a small, tidy set of documents. The problem is that real businesses aren't small or tidy, and the gap between what RAG promises and what it delivers opens up fast.

Why does RAG fail at production scale?

The architecture was designed for small, fixed document sets. Real businesses are neither.

Three problems compound each other once you move beyond that. The first is that AI models can hallucinate even when they find the right document.

A model's own ingrained knowledge sometimes overwrites what it just retrieved. It finds the right source, then ignores it, reverting to what it already believed. That's how you get a confident, plausible-sounding answer about a policy that was updated three months ago.

The second is a scale problem. Most RAG systems are tested against tens of thousands of documents. A mid-market company generates hundreds of thousands a year. A single business software system can produce millions of transaction records.

When the document library grows past a certain size, the AI's ability to distinguish relevant from irrelevant breaks down. Everything looks equally relevant. The system can no longer reliably surface what you actually need.

The third is the result: over 80% of AI projects fail, twice the rate of non-AI technology projects. 42% of enterprises abandoned most of their AI initiatives in 2025, up from 17% the year before. These aren't teething problems.

They're the architecture doing exactly what it was designed to do, just at a scale it was never designed for.

Does a bigger, faster model fix it?

It helps. It doesn't fix the underlying problem.

One response has been to build AI with larger working memories. Modern models can hold the equivalent of several novels' worth of text at once, and feeding entire documents directly into that memory often produces better results than retrieval alone. That's genuine progress.

But the deeper issue remains. AI models generate responses by predicting the most likely next word, based on everything they've been trained on. That's what makes them fluent. It's also what makes them unpredictable.

More memory and faster processing doesn't change that underlying behaviour. For a consumer chatbot, some variance is acceptable. For a business processing invoices, advising customers, or making operational decisions, it isn't.

What actually works at enterprise scale?

The architecture that holds up at scale is orchestration: breaking business processes into defined steps and applying AI within a structure that keeps it on track.

Every approach that works in the research follows the same pattern. Give AI a defined structure to operate within, rather than asking it to generate the right answer on its own.

Instead of asking "given these documents, what would you say?", a well-designed system says: complete this specific task, in this order, with these guardrails, produce output in this format. The AI handles the reasoning. The structure handles the reliability.

In practice this means designing workflows where each step has a clear input, a defined process and a predictable output. AI handles the reasoning at each step. It isn't asked to manage the whole chain on its own. The result is something you can check, repeat and trust.

Many teams who set out to fix RAG's reliability have ended up building exactly this. Retrieval is still there, but it's one tool the system uses, not the whole system. The structure does the work.

It's why the future of enterprise AI looks less like a freelance assistant and more like an operating system: defined processes, clear accountability, outputs you can rely on. We've written more about why this distinction matters for businesses building on AI.

Is your AI operating inside a structure?

If your business has run an agentic AI pilot that showed promise and stalled, it's worth examining the architecture before you iterate on the model or the data.

Two questions are worth sitting with. Is the AI operating inside a structured workflow with clear constraints? And when it produces an output, is there a record your team can inspect and learn from?

The first approach scales. The second has a ceiling. The research now explains precisely where that ceiling sits.*Woolworths has not publicly confirmed whether Olive uses RAG. Any reference to RAG here is illustrative of a common architectural pattern in enterprise AI, not a claim about Woolworths’ specific implementation.

This post is based on original research and analysis by Decidr Co-CEO David Brudenell. Read the full article here.

See how Decidr builds orchestrated AI workflows that work at production scale.

Why AI gets the wrong answer even when it has the data

What is RAG, and why does it matter?

Why does RAG fail at production scale?

Does a bigger, faster model fix it?

What actually works at enterprise scale?

Is your AI operating inside a structure?

You may also be interested in

The big AI cost blowout: You don't need a Ferrari for every AI task

Satya Nadella just gave CEOs a blueprint for the next phase of AI. Here's why it matters.

What makes Decidr different?

The world of AI is moving quickly. Keep your finger on the pulse with our newsletter.