AI EngineeringAgentic AI9 min readUpdated

How to Evaluate LLM Agents in Production: Beyond Unit Tests

By Mudassir Khan — Agentic AI Consultant & AI Systems Architect, Islamabad, Pakistan

Cover illustration for: How to Evaluate LLM Agents in Production: Beyond Unit Tests

Section 01 · The Core Problem

Why evaluating agents is different from evaluating LLM calls

A single LLM call either answers the question well or it does not. An agent run makes 20 to 100 decisions in sequence. A failure at step 7 can produce a plausible-looking final output that is completely wrong.

Quick answer

The short answer: Agent evaluation must happen at the span level — each tool call, retrieval decision, and reasoning step — not just at the final output. Output evaluation catches failures after they have already propagated through the pipeline.

The standard for evaluating a chatbot — does the output answer the question, is it factually accurate, does it match the style guide — is insufficient for agents. An agent that retrieves the wrong document, calls the right tool with the wrong parameters, or misclassifies a user intent at step 3 will often produce a confident-looking final output. By the time you evaluate the output, the error has already propagated across the remaining steps.

Nearly half of agentic AI projects are predicted to be cancelled in 2026 for lack of proper evaluation infrastructure. Teams ship, get inconsistent results, cannot diagnose why, and lose confidence in the system. The fix is not a better model — it is better measurement at the step level.

Section 02 · Failure Categories

The three failure categories you need to measure

Retrieval failures

The agent retrieves the wrong documents, retrieves too few, or retrieves contextually irrelevant chunks. Downstream reasoning is then grounded in wrong information. RAGAS context precision and context recall measure this. Target context precision above 0.80 and context recall above 0.75.

Reasoning failures

The agent has the right context but draws the wrong conclusion, misclassifies an intent, or chooses the wrong tool for the task. These are harder to measure automatically and often require a separate judge model or a curated evaluation dataset with known-correct reasoning paths.

Action failures

The agent calls the right tool with wrong parameters, calls the wrong tool, or takes an action that is technically valid but contextually inappropriate. Span-level logging of every tool call with its parameters, return value, and the agent's subsequent reasoning step is the only way to catch these consistently.

Section 03 · RAGAS Metrics

The five RAGAS metrics for production RAG-based agents

RAGAS production metrics — definitions and targets
MetricWhat it measuresTarget
FaithfulnessClaims in the answer are supported by retrieved contextAbove 0.90
Answer relevancyAnswer addresses what the question askedAbove 0.85
Context precisionRetrieved chunks are relevant to the questionAbove 0.80
Context recallAll information needed to answer was retrievedAbove 0.75
Answer correctnessAnswer is factually correct vs ground truthAbove 0.80

RAGAS runs without ground truth labels for faithfulness, answer relevancy, and context precision. This makes it practical to run on live production traffic where you do not have human-verified correct answers for every query. Context recall and answer correctness require ground truth, so use them on a curated evaluation set during development, not live traffic.

Section 04 · Span-Level Evaluation

Measuring at the step, not the output

Span-level evaluation logs every intermediate step of an agent run as a named span with its inputs, outputs, latency, and token cost. This is what LangSmith captures by default for LangGraph-based agents.

Each tool call is a span. Each retrieval is a span. Each reasoning step is a span. When an agent run produces a wrong result, you open the trace in LangSmith, find the span where the error originated, and read the exact inputs, outputs, and context that were present at that step. You do not guess — you see it.

This is the property that separates debuggable production systems from brittle ones. Without span-level observability, a wrong agent output is a mystery. With it, the wrong output is a single span you can identify, reproduce, and fix.

Span-level evaluation flow: each agent step (retrieval, reasoning, tool call) is logged as a named span. RAGAS and judge models evaluate spans asynchronously. Dashboards surface threshold breaches.
Span-level evaluation catches failures at the step where they originate. Output evaluation only sees the final result — after the failure has already propagated.

Section 05 · The Evaluation Stack

LangSmith plus RAGAS plus DeepEval: the 2026 production stack

LangSmith for observability

Captures every span for LangGraph-based agents automatically. Stores traces. Supports RAGAS integration. Lets you run evaluators on live traffic samples and historical traces. The minimum viable setup for any production agent.

RAGAS for retrieval quality

Reference-free metrics for faithfulness, answer relevancy, and context precision on live traffic. Run asynchronously on a 5 to 10% sample of production queries. Alert on metric drops below threshold.

DeepEval for behavioral testing

Test suite framework for evaluating agent behavior against curated datasets. Run in CI/CD on every deployment to catch regressions before they reach production. Covers hallucination detection, prompt injection resilience, and custom behavioral metrics.

Section 06 · Production Checklist

The minimum evaluation setup before you ship

Production evaluation checklist for LLM agents
RequirementToolFrequency
Span-level tracing for all agent runsLangSmithAlways on
Faithfulness above 0.90RAGAS via LangSmithAsync, 10% sample
Answer relevancy above 0.85RAGAS via LangSmithAsync, 10% sample
Behavioral regression testsDeepEval in CI/CDEvery deployment
Tool-call schema validationCustom validator in pipelineEvery tool call
Human review queue for low-confidence runsLangSmith datasetWeekly

FAQ

Frequently asked questions

How do you evaluate AI agents in production?

Run span-level tracing to capture every intermediate step, tool call, and retrieval decision. Use RAGAS metrics asynchronously on a sample of live traffic to monitor faithfulness and answer relevancy. Run behavioral regression tests with DeepEval on every deployment. Avoid blocking the response pipeline on evaluation — run it asynchronously.

What is span-level evaluation for LLM agents?

Span-level evaluation logs each intermediate step of an agent run — each tool call, retrieval step, and reasoning step — as a named span with its inputs, outputs, and context. Evaluating at the span level lets you identify exactly which step produced an error rather than reverse-engineering it from the final output.

What RAGAS metrics should I use for a production RAG agent?

Start with faithfulness and answer relevancy — both are reference-free and can run on live traffic without ground truth labels. Target faithfulness above 0.90 and answer relevancy above 0.85. Add context precision and context recall using a curated evaluation dataset to measure retrieval quality specifically.

Is LangSmith the best evaluation tool for LangGraph agents?

LangSmith is the most integrated option for LangGraph-based agents — it captures spans automatically without instrumentation code, supports RAGAS integration natively, and provides a dataset interface for running evaluations on historical traces. For teams on other frameworks, Arize Phoenix and Langfuse are strong alternatives with similar capability.

Written by Mudassir Khan

Agentic AI consultant and AI systems architect based in Islamabad, Pakistan. CEO of Cube A Cloud. 38+ agentic AI launches delivered for global founders and CTOs.

View agentic AI consulting serviceSee SentientOps case study

Related service

Agentic AI Consulting

See scope & pricing →

Related case study

SentientOps Control Center

Read case study →

More on this topic

Need an AI systems architect?

Book a 30-minute architecture call. I will sketch the high-level design for your use case and give you an honest view of the trade-offs.

Book a strategy call →