RAGAI Engineering11 min readUpdated

Production RAG: Why Retrieval Fails and How to Fix It

By Mudassir Khan — Agentic AI Consultant & AI Systems Architect, Islamabad, Pakistan

Cover illustration for: Production RAG: Why Retrieval Fails and How to Fix It

Section 01 · The Problem

Why most RAG pipelines fail in production

The failure is almost never in generation. When a RAG system gives a wrong, hallucinated, or incomplete answer, the root cause is usually retrieval — the system fetched the wrong chunks, or none at all.

Quick answer

The short answer: A production RAG pipeline fails when the retriever returns irrelevant or incomplete context. The generator then has nothing correct to work from, so it either hallucinates or hedges. Fix retrieval first.

In 2026, naive RAG — fixed-size chunking plus single-vector similarity search — fails to retrieve the correct context roughly 40% of the time. That number climbs as document collections grow and queries become more specific. The generator is doing its job. The retriever is not giving it the material it needs.

There are four root causes. Each has a corresponding fix, and the fixes are ordered by return on investment. Start at the top.

Four root causes of RAG retrieval failure: wrong chunk boundaries, missing keyword recall, no reranking, and retrieval without confidence scoring.
The four failure modes appear at different stages of the retrieval pipeline. Most teams encounter them in the order shown.

Section 02 · Chunking

Stop splitting by character count

Chunking strategy constrains retrieval accuracy more than embedding model choice. A 2025 clinical study found adaptive chunking achieved 87% retrieval accuracy versus 13% for fixed-size baselines on the same dataset.

Fixed-size chunking — splitting every 512 or 1024 characters regardless of content — cuts sentences mid-thought, separates questions from their answers, and drops the context that makes a passage meaningful. The embedding model encodes an incomplete idea. The similarity score is lower than it should be. The retriever misses.

Semantic chunking

Uses embedding similarity to detect topic boundaries. When the cosine distance between adjacent sentences crosses a threshold, the chunker starts a new chunk. Each chunk contains one coherent idea. This is the practical default for most RAG systems in 2026.

Proposition chunking

Decomposes documents into atomic factual claims, each expressing exactly one verifiable statement. This is the highest-precision approach for knowledge-intensive applications like legal research and medical QA, where retrieval of a single misattributed fact is unacceptable.

Hierarchical chunking

Maintains both a summary chunk and its constituent child chunks. At query time the system retrieves the summary for context and the child chunk for precision. Works well for long documents where section-level context matters to interpret paragraph-level content.

Whichever strategy you choose, validate with recall metrics on a sample query set before deploying. Chunking quality is invisible until you measure it.

Section 04 · Evaluation

RAGAS: the five numbers that matter in production

RAGAS provides reference-free evaluation metrics you can run on live traffic without human annotation. These five metrics cover the full retrieval-to-answer pipeline.

RAGAS production metrics — target values for a reliable RAG system
MetricWhat it measuresProduction target
FaithfulnessDoes the answer contain only claims supported by the retrieved context?Above 0.90
Answer relevancyDoes the answer address what the question asked?Above 0.85
Context precisionAre the retrieved chunks actually relevant to the question?Above 0.80
Context recallDid retrieval surface all the information needed to answer?Above 0.75
Answer correctnessIs the answer factually correct compared to ground truth?Above 0.80

Faithfulness is the most important metric for production safety. A faithfulness score below 0.85 means the model is regularly generating claims not supported by what it retrieved — that is hallucination by definition. Fix retrieval or increase top-k before deploying.

Run RAGAS evaluations asynchronously on a sample of production traffic, not inline with user requests. Blocking the response pipeline on evaluation adds latency and gains nothing for the user. Collect, evaluate overnight, alert on threshold breaches.

Section 05 · Architecture

Adaptive RAG: the 2026 architecture standard

Adaptive RAG classifies each incoming query before retrieval and routes it to the appropriate strategy. It is the architecture that separates production systems from prototypes.

A naive RAG system treats every query identically: retrieve, then generate. Adaptive RAG adds a classification step at the front. Simple factual queries route to fast vector search. Complex multistep queries route to iterative or hierarchical retrieval. Queries outside the knowledge base route directly to the model's parametric knowledge, skipping retrieval entirely.

The routing logic is usually a small LLM call or a classifier. The cost is low — a few milliseconds and a few tokens — and the accuracy gain is significant. Systems that skip retrieval when retrieval confidence is low produce far fewer hallucinations than systems that always retrieve and pass low-quality context.

Adaptive RAG flow: query classifier routes to fast retrieval, iterative retrieval, or direct generation based on query type and retrieval confidence.
Adaptive RAG routes each query to the appropriate retrieval strategy. The confidence check before generation is the feature that prevents low-quality context from reaching the model.

If you are building a new RAG system in 2026, design for adaptive routing from the start. Adding it later requires restructuring the retrieval pipeline, not just wrapping it.

For production agentic AI systems that use RAG as a memory or knowledge layer, see my agentic AI consulting service for how retrieval fits into a broader agentic architecture.

Section 06 · Cost

What RAG costs per query at different complexity levels

The upgrade path has a real cost. Here is what to budget as you move from naive to adaptive.

Cost per query estimates across RAG complexity levels (2026)
ArchitectureTypical cost per queryQuality ceiling
Naive vector only$0.0005 to $0.002Moderate — fails on exact match and multi-concept queries
Hybrid search + reranker$0.002 to $0.008Good — handles most production query types
Adaptive RAG with routing$0.005 to $0.015High — near-ceiling for retrieval-based systems
Agentic RAG (iterative)$0.02 to $0.10Very high — for research-grade and analyst workflows

FAQ

Frequently asked questions

Why does RAG fail even when the chunks look correct?

Chunk content and retrieval ranking are separate problems. A chunk may contain the right information but rank below the top-k cutoff because the embedding similarity is lower than irrelevant but superficially similar chunks. The fix is a reranker that re-scores based on the actual question-chunk relationship, not just embedding proximity.

What is the difference between semantic chunking and fixed-size chunking?

Fixed-size chunking splits every N characters regardless of content, frequently cutting sentences or ideas in half. Semantic chunking uses embedding similarity between adjacent sentences to detect topic boundaries, keeping coherent ideas together in a single chunk. Semantic chunking consistently outperforms fixed-size chunking on retrieval accuracy benchmarks.

How much does adding a reranker improve RAG quality?

A cross-encoder reranker reliably moves the correct chunk from position 8 or 12 into the top 3, which is all the language model sees. Teams who add reranking to an existing hybrid search pipeline typically see 20 to 40 percent improvement in faithfulness scores without changing any other component.

What RAGAS score should I target before going to production?

Faithfulness above 0.90, answer relevancy above 0.85. If either metric is below those thresholds on a representative sample of production queries, diagnose the failure before shipping. Below 0.85 faithfulness in production means roughly 1 in 7 responses contains a hallucinated claim.

When should I use adaptive RAG versus standard RAG?

Use adaptive RAG when your query set is heterogeneous — some queries need fast retrieval, some need iterative search, and some are outside your knowledge base entirely. If every query is similar in nature and your knowledge base is well-bounded, standard hybrid RAG with reranking is sufficient.

Written by Mudassir Khan

Agentic AI consultant and AI systems architect based in Islamabad, Pakistan. CEO of Cube A Cloud. 38+ agentic AI launches delivered for global founders and CTOs.

View agentic AI consulting serviceSee SentientOps case study

Related service

Agentic AI Consulting

See scope & pricing →

Related case study

SentientOps Control Center

Read case study →

More on this topic

Need an AI systems architect?

Book a 30-minute architecture call. I will sketch the high-level design for your use case and give you an honest view of the trade-offs.

Book a strategy call →