RAGAI SystemsAI Engineering11 min readUpdated

RAG Evaluation: Metrics That Actually Matter in Production

By Mudassir Khan — Agentic AI Consultant & AI Systems Architect, Islamabad, Pakistan

Cover illustration for: RAG Evaluation: Metrics That Actually Matter in Production

Quick answer

What is RAG evaluation in one sentence? RAG evaluation measures whether a retrieval augmented generation system is retrieving the right content and generating responses that accurately reflect it, across four production metrics: retrieval recall, answer faithfulness, groundedness, and latency relative to the quality those metrics reveal.

Section 01 · The problem

Why “it seems to work” fails at scale

Every RAG system seems to work during demos. The questions are curated, the documents are fresh, and the evaluator is the person who built the system. Eyeballing outputs in a dev environment is not a quality signal, it is a bias confirmation exercise.

The failure mode that kills RAG systems in production is subtler than obvious hallucination. It usually looks like this: the system retrieves chunks that are plausibly related to the query but not the most relevant ones. The generator, seeing context that is close but not exact, produces a confident answer that sounds correct but is factually off by one detail. A user trusts the answer. The error propagates downstream. Nobody catches it until a user or a compliance review surfaces the problem, by which point the system has served thousands of similarly flawed responses.

A systematic evaluation pipeline catches this before it reaches users. The production RAG guide covers the full architecture of a production RAG system; this guide focuses specifically on the evaluation layer that tells you whether that architecture is working.

Section 02 · Metric 1

Retrieval recall, the metric that catches silent failures

Retrieval recall measures whether the chunks your system needs to answer a question were actually retrieved. Of all the relevant chunks that exist in your corpus for a given query, what fraction did your retriever return?

What low recall actually means

A system with 60 percent retrieval recall means that 40 percent of the time, the most relevant document is not in the context window. The generator then produces an answer from whatever was retrieved, which might be adjacent, plausible, and wrong. Low retrieval recall is the most common root cause of RAG system failures, and the hardest to detect without measuring it explicitly.

How to measure it

You need a golden dataset: a set of question and relevant chunk pairs where you have manually verified which chunks are ground truth relevant for each question. For each question in your golden set, check whether your retriever actually returned those ground truth chunks within its top k results.

Production targets

A production RAG system should target retrieval recall above 85 percent. Below 70 percent, the retriever is the primary problem, and reranking or embedding model changes are the right intervention.

Common causes of low retrieval recall

Embedding model mismatch (your embedding model does not represent your domain well), poor chunking strategy (relevant information is split across chunk boundaries), metadata filtering bugs (filters are excluding relevant documents), or a corpus indexing problem (not all documents are indexed).

Section 03 · Metric 2

Answer faithfulness, the test for context use

Answer faithfulness measures whether every claim in the generated response is directly supported by the retrieved context. An answer is unfaithful if it makes claims the context does not support, even if those claims happen to be factually correct by external knowledge.

The distinction matters. A RAG system that draws on model parametric knowledge rather than the retrieved context is not behaving as designed. You cannot control what the model knows from pretraining. You can control what context you provide. Faithfulness measures whether the generator is actually using that context.

RAGAS measures faithfulness by decomposing the generated answer into individual claims and checking each claim against the retrieved context using a judge model. A claim that cannot be traced to the context reduces the faithfulness score. A perfect faithfulness score of 1.0 means every claim in the answer can be verified from the retrieved chunks.

Practical faithfulness targets by domain risk profile.
DomainTarget scoreBelow this score, intervene
Low stakes (internal tools, content)Above 0.85Below 0.75
Customer facing, general SaaSAbove 0.88Below 0.80
Regulated or high stakes (fintech, health, legal)Above 0.92Below 0.85

Below 0.75, the generator is frequently making things up beyond what the context supports, and prompt engineering to reinforce “only answer from context” is the first intervention.

Section 04 · Metric 3

Groundedness, the answer relevance check

Groundedness, sometimes called answer relevance in RAGAS, measures whether the generated answer actually addresses the question asked. It is possible for an answer to be perfectly faithful to the retrieved context while still failing to answer the question, if the retrieved context is itself not relevant.

This metric catches a specific failure mode: the retriever returns chunks that are topically related but not actually relevant to the specific question. The generator faithfully summarizes those chunks and produces an answer that is confidently off topic.

A simple way to measure groundedness: given the generated answer, could you reconstruct the original question? If yes, the answer is grounded. RAGAS implements this by having a judge model reverse engineer what question the answer is responding to, then measuring the semantic similarity between that reconstructed question and the original.

Where groundedness usually breaks

Low groundedness is usually a retrieval problem (the wrong context is coming back) or a query formulation problem (the question is ambiguous enough that the retriever cannot find the right content). Improving groundedness typically requires improving the retrieval step, not the generator.

Section 05 · Tooling

The RAGAS framework: what it does well and what it does not

RAGAS (Retrieval Augmented Generation Assessment) is the most widely used open source library for automating RAG evaluation. It measures faithfulness, answer relevance (groundedness), context precision, and context recall using a judge model (defaulting to GPT 4 or Claude).

What RAGAS does well

It automates the measurement of four core metrics against a golden dataset, produces numerical scores you can track over time, and integrates with LangChain and LlamaIndex pipelines. For teams that want systematic evaluation without building it from scratch, it is the fastest path to a production evaluation pipeline.

What RAGAS does not do

It does not define what “correct” is for your domain. The judge model brings its own biases and errors. RAGAS scores above a threshold do not guarantee your system is performing correctly, they guarantee it is performing consistently against whatever your judge model considers correct. For high stakes domains, human review of a sample of outputs should always accompany RAGAS scoring.

The hidden requirement

RAGAS requires a question and context dataset, not just questions. Building this golden dataset is the hard part, and the part most teams skip by using synthetic data generation instead of real queries.

Section 06 · Ground truth

Building a golden evaluation dataset

A golden dataset is a set of question, ground truth relevant chunks, and expected answer triples that you have manually verified. It is the ground truth against which your automated metrics are measured.

The fastest way to build one that is actually useful: start with 50 to 100 real user queries from production (if you have them) or representative synthetic queries. For each query, manually identify which chunks in your corpus are the ground truth relevant chunks. This is the retrieval ground truth. Then write or verify an expected answer for each query.

Golden dataset sizing by intended use.
Dataset sizeWhat it gives youWhen to expand
50 examplesMeaningful recall measurement, faithfulness calibrationBefore any production launch
100 examplesCoverage of most query patterns for a focused domainWhen new query types appear in logs
200+ examplesDiminishing returns unless query distribution is very diverseOnly if multiple distinct user segments

The key principle: use real queries whenever possible. Synthetic query generation (having a model generate questions from your documents) produces a dataset that is biased toward what the documents discuss rather than what users actually ask. The mismatch is often significant.

Factuality and faithfulness metrics from the RAGAS framework perform better as your golden dataset grows in diversity, specifically in the diversity of what the correct answer is allowed to say versus what the context explicitly states. Start small and expand the dataset based on which query patterns reveal measurement gaps.

Section 07 · Diagnosis

Is the retriever or the generator the problem?

When a RAG system produces a bad output, the diagnosis usually comes down to one question: was the right context retrieved? The three way diagnostic below saves significant debugging time.

High recall, low faithfulness, the generator is the problem

The context was there, but the model did not use it correctly. Interventions: prompt engineering with stronger instructions to stay within context, a model upgrade, or output filtering. The retriever is doing its job; the generator is freelancing.

Low recall, the retriever is the problem

Everything downstream is operating on insufficient context. The generator might produce a fluent, confident wrong answer, not because it is hallucinating maliciously, but because it is doing the best it can with the context it received. Interventions: reranking, embedding model change, chunking strategy revision, or query expansion.

Acceptable recall, low groundedness, the chunks are wrong even when they look right

The retrieved chunks are correct but not specific enough to answer the question. Interventions: finer grained chunking, metadata based filtering to narrow context to the most relevant document, or hybrid retrieval that combines semantic search with keyword search.

Rather than guessing whether to fix the prompt or the retriever, you measure which metric is failing and target the intervention appropriately. The AI systems architecture service I run includes building this evaluation pipeline as a standard part of production RAG engagements, because the diagnostic layer is what makes the system maintainable rather than a black box.

Section 08 · FAQ

Frequently asked questions

The questions teams ask most before building out a RAG evaluation pipeline.

What is RAG evaluation?

RAG evaluation is the systematic measurement of whether a retrieval augmented generation system is retrieving the right content and generating responses that accurately and completely reflect it. The four core metrics are retrieval recall, answer faithfulness, groundedness, and latency relative to the quality those metrics reveal.

How do you evaluate a RAG system?

Start with a golden evaluation dataset: a set of question and relevant chunk pairs you have manually verified. Measure retrieval recall by checking whether the ground truth chunks appear in the retrieved results. Measure faithfulness and groundedness using the RAGAS library or a custom judge model evaluation. Run the evaluation on a cadence, not just once at launch, so you detect quality regressions as your corpus or retriever changes.

What is the RAGAS framework?

RAGAS (Retrieval Augmented Generation Assessment) is an open source Python library for automating RAG evaluation. It measures faithfulness, answer relevance, context precision, and context recall using a judge model. It integrates with LangChain and LlamaIndex and is the fastest way to build an automated evaluation pipeline for a RAG system without writing the scoring logic from scratch.

What is answer faithfulness in RAG?

Answer faithfulness measures whether every claim in a generated response is directly supported by the retrieved context. An unfaithful answer makes claims the context does not support, drawing on the model parametric knowledge rather than the provided context. In a RAG system, faithfulness is a direct measure of whether the generator is actually using the retrieval layer as designed.

How do I improve RAG retrieval accuracy?

Start by measuring retrieval recall against a golden dataset to establish a baseline. Common improvements: switch to a domain specific embedding model, adjust chunk size and overlap to keep relevant information within single chunks, add reranking with a cross encoder to reorder retrieved results, use hybrid retrieval (dense plus sparse) to catch keyword specific queries that semantic search misses, and add metadata filtering to scope retrieval to the right document subsets.

Written by Mudassir Khan

Agentic AI consultant and AI systems architect based in Islamabad, Pakistan. CEO of Cube A Cloud. 38+ agentic AI launches delivered for global founders and CTOs.

View AI systems architecture serviceSee NebulaDesk case study

Related service

AI Systems Architecture

See scope & pricing →

Related case study

NebulaDesk Agentic Workspace

Read case study →

More on this topic

Need an AI systems architect?

Book a 30-minute architecture call. I will sketch the high-level design for your use case and give you an honest view of the trade-offs.

Book a strategy call →