Section 01 · The Problem
Why most RAG pipelines fail in production
The failure is almost never in generation. When a RAG system gives a wrong, hallucinated, or incomplete answer, the root cause is usually retrieval — the system fetched the wrong chunks, or none at all.
Quick answer
The short answer: A production RAG pipeline fails when the retriever returns irrelevant or incomplete context. The generator then has nothing correct to work from, so it either hallucinates or hedges. Fix retrieval first.
In 2026, naive RAG — fixed-size chunking plus single-vector similarity search — fails to retrieve the correct context roughly 40% of the time. That number climbs as document collections grow and queries become more specific. The generator is doing its job. The retriever is not giving it the material it needs.
There are four root causes. Each has a corresponding fix, and the fixes are ordered by return on investment. Start at the top.
Section 02 · Chunking
Stop splitting by character count
Chunking strategy constrains retrieval accuracy more than embedding model choice. A 2025 clinical study found adaptive chunking achieved 87% retrieval accuracy versus 13% for fixed-size baselines on the same dataset.
Fixed-size chunking — splitting every 512 or 1024 characters regardless of content — cuts sentences mid-thought, separates questions from their answers, and drops the context that makes a passage meaningful. The embedding model encodes an incomplete idea. The similarity score is lower than it should be. The retriever misses.
Semantic chunking
Uses embedding similarity to detect topic boundaries. When the cosine distance between adjacent sentences crosses a threshold, the chunker starts a new chunk. Each chunk contains one coherent idea. This is the practical default for most RAG systems in 2026.
Proposition chunking
Decomposes documents into atomic factual claims, each expressing exactly one verifiable statement. This is the highest-precision approach for knowledge-intensive applications like legal research and medical QA, where retrieval of a single misattributed fact is unacceptable.
Hierarchical chunking
Maintains both a summary chunk and its constituent child chunks. At query time the system retrieves the summary for context and the child chunk for precision. Works well for long documents where section-level context matters to interpret paragraph-level content.
Whichever strategy you choose, validate with recall metrics on a sample query set before deploying. Chunking quality is invisible until you measure it.
Section 03 · Retrieval
Hybrid search and reranking: the two highest-ROI upgrades
Running BM25 and vector search in parallel, then fusing results with Reciprocal Rank Fusion, is the single biggest quality improvement available to a naive RAG pipeline.
Vector search retrieves semantically similar passages — it handles paraphrase and concept matching well but misses exact keyword matches. BM25 handles exact matches and rare terms well but misses semantic relationships. Neither alone is sufficient for a production RAG system that handles varied query types.
Hybrid search runs both in parallel and fuses the ranked lists using Reciprocal Rank Fusion. When both hybrid retrieval and contextual techniques are combined, error rates drop by roughly 69% compared to naive vector-only retrieval. The implementation is straightforward in any production vector store: Weaviate ships hybrid search natively; Pinecone added it in 2025; pgvector requires composing it manually with a BM25 index. To filter the full landscape of vector stores by hybrid search support, hosting model, and price, use the Vector Database Comparison Matrix.
| Upgrade | Lift | Implementation cost | Priority |
|---|---|---|---|
| Semantic chunking | High | Low | Do first |
| Hybrid search (BM25 + vector) | High | Low to medium | Do second |
| Cross-encoder reranker | High | Medium | Do third |
| Contextual retrieval | Medium | Medium | Do fourth |
| Adaptive RAG routing | Medium to high | High | Do when at scale |
The reranking step deserves its own emphasis. A cross-encoder model re-scores each retrieved chunk against the original query with full attention — it sees both the query and the chunk together, unlike the bi-encoder that scores them separately. A typical production pipeline retrieves top-50 with hybrid search, reranks to top-5 with a cross-encoder, then passes those five chunks to the language model. The cost is modest; the precision improvement is substantial.
Section 04 · Evaluation
RAGAS: the five numbers that matter in production
RAGAS provides reference-free evaluation metrics you can run on live traffic without human annotation. These five metrics cover the full retrieval-to-answer pipeline.
| Metric | What it measures | Production target |
|---|---|---|
| Faithfulness | Does the answer contain only claims supported by the retrieved context? | Above 0.90 |
| Answer relevancy | Does the answer address what the question asked? | Above 0.85 |
| Context precision | Are the retrieved chunks actually relevant to the question? | Above 0.80 |
| Context recall | Did retrieval surface all the information needed to answer? | Above 0.75 |
| Answer correctness | Is the answer factually correct compared to ground truth? | Above 0.80 |
Faithfulness is the most important metric for production safety. A faithfulness score below 0.85 means the model is regularly generating claims not supported by what it retrieved — that is hallucination by definition. Fix retrieval or increase top-k before deploying.
Run RAGAS evaluations asynchronously on a sample of production traffic, not inline with user requests. Blocking the response pipeline on evaluation adds latency and gains nothing for the user. Collect, evaluate overnight, alert on threshold breaches.
Section 05 · Architecture
Adaptive RAG: the 2026 architecture standard
Adaptive RAG classifies each incoming query before retrieval and routes it to the appropriate strategy. It is the architecture that separates production systems from prototypes.
A naive RAG system treats every query identically: retrieve, then generate. Adaptive RAG adds a classification step at the front. Simple factual queries route to fast vector search. Complex multistep queries route to iterative or hierarchical retrieval. Queries outside the knowledge base route directly to the model's parametric knowledge, skipping retrieval entirely.
The routing logic is usually a small LLM call or a classifier. The cost is low — a few milliseconds and a few tokens — and the accuracy gain is significant. Systems that skip retrieval when retrieval confidence is low produce far fewer hallucinations than systems that always retrieve and pass low-quality context.
If you are building a new RAG system in 2026, design for adaptive routing from the start. Adding it later requires restructuring the retrieval pipeline, not just wrapping it.
For production agentic AI systems that use RAG as a memory or knowledge layer, see my agentic AI consulting service for how retrieval fits into a broader agentic architecture.
Section 06 · Cost
What RAG costs per query at different complexity levels
The upgrade path has a real cost. Here is what to budget as you move from naive to adaptive.
| Architecture | Typical cost per query | Quality ceiling |
|---|---|---|
| Naive vector only | $0.0005 to $0.002 | Moderate — fails on exact match and multi-concept queries |
| Hybrid search + reranker | $0.002 to $0.008 | Good — handles most production query types |
| Adaptive RAG with routing | $0.005 to $0.015 | High — near-ceiling for retrieval-based systems |
| Agentic RAG (iterative) | $0.02 to $0.10 | Very high — for research-grade and analyst workflows |
FAQ
Frequently asked questions
Why does RAG fail even when the chunks look correct?
Chunk content and retrieval ranking are separate problems. A chunk may contain the right information but rank below the top-k cutoff because the embedding similarity is lower than irrelevant but superficially similar chunks. The fix is a reranker that re-scores based on the actual question-chunk relationship, not just embedding proximity.
What is the difference between semantic chunking and fixed-size chunking?
Fixed-size chunking splits every N characters regardless of content, frequently cutting sentences or ideas in half. Semantic chunking uses embedding similarity between adjacent sentences to detect topic boundaries, keeping coherent ideas together in a single chunk. Semantic chunking consistently outperforms fixed-size chunking on retrieval accuracy benchmarks.
How much does adding a reranker improve RAG quality?
A cross-encoder reranker reliably moves the correct chunk from position 8 or 12 into the top 3, which is all the language model sees. Teams who add reranking to an existing hybrid search pipeline typically see 20 to 40 percent improvement in faithfulness scores without changing any other component.
What RAGAS score should I target before going to production?
Faithfulness above 0.90, answer relevancy above 0.85. If either metric is below those thresholds on a representative sample of production queries, diagnose the failure before shipping. Below 0.85 faithfulness in production means roughly 1 in 7 responses contains a hallucinated claim.
When should I use adaptive RAG versus standard RAG?
Use adaptive RAG when your query set is heterogeneous — some queries need fast retrieval, some need iterative search, and some are outside your knowledge base entirely. If every query is similar in nature and your knowledge base is well-bounded, standard hybrid RAG with reranking is sufficient.