How to Evaluate LLM Agents in Production: Beyond Unit Tests

Key takeaways

Agent failures happen at the span level — a wrong tool call, a hallucinated retrieval decision, a missed condition in a reasoning step — not at the final output. Unit tests and output evaluation catch these too late.
RAGAS provides five reference-free metrics for RAG-based agents you can run on live traffic: faithfulness, answer relevancy, context precision, context recall, and answer correctness.
Span-level evaluation means measuring at each tool call, each retrieval step, and each reasoning step — not just at the final answer. This is what separates observable production systems from brittle ones.
LangSmith is the minimum viable observability setup for any LangGraph-based production agent. It captures every span, supports RAGAS integration, and lets you run evaluations on live traffic samples.
Run evaluations asynchronously on a sample of production traffic. Blocking the response pipeline on evaluation adds latency and provides no user value.

Section 01 · The Core Problem

Why evaluating agents is different from evaluating LLM calls

A single LLM call either answers the question well or it does not. An agent run makes 20 to 100 decisions in sequence. A failure at step 7 can produce a plausible-looking final output that is completely wrong.

Quick answer

The short answer: Agent evaluation must happen at the span level — each tool call, retrieval decision, and reasoning step — not just at the final output. Output evaluation catches failures after they have already propagated through the pipeline.

The standard for evaluating a chatbot — does the output answer the question, is it factually accurate, does it match the style guide — is insufficient for agents. An agent that retrieves the wrong document, calls the right tool with the wrong parameters, or misclassifies a user intent at step 3 will often produce a confident-looking final output. By the time you evaluate the output, the error has already propagated across the remaining steps.

Nearly half of agentic AI projects are predicted to be cancelled in 2026 for lack of proper evaluation infrastructure. Teams ship, get inconsistent results, cannot diagnose why, and lose confidence in the system. The fix is not a better model — it is better measurement at the step level.

Section 02 · Failure Categories

The three failure categories you need to measure

Retrieval failures

The agent retrieves the wrong documents, retrieves too few, or retrieves contextually irrelevant chunks. Downstream reasoning is then grounded in wrong information. RAGAS context precision and context recall measure this. Target context precision above 0.80 and context recall above 0.75.

Reasoning failures

The agent has the right context but draws the wrong conclusion, misclassifies an intent, or chooses the wrong tool for the task. These are harder to measure automatically and often require a separate judge model or a curated evaluation dataset with known-correct reasoning paths.

Action failures

The agent calls the right tool with wrong parameters, calls the wrong tool, or takes an action that is technically valid but contextually inappropriate. Span-level logging of every tool call with its parameters, return value, and the agent's subsequent reasoning step is the only way to catch these consistently.

Section 03 · RAGAS Metrics

The five RAGAS metrics for production RAG-based agents

RAGAS production metrics — definitions and targets
Metric	What it measures	Target
Faithfulness	Claims in the answer are supported by retrieved context	Above 0.90
Answer relevancy	Answer addresses what the question asked	Above 0.85
Context precision	Retrieved chunks are relevant to the question	Above 0.80
Context recall	All information needed to answer was retrieved	Above 0.75
Answer correctness	Answer is factually correct vs ground truth	Above 0.80

RAGAS runs without ground truth labels for faithfulness, answer relevancy, and context precision. This makes it practical to run on live production traffic where you do not have human-verified correct answers for every query. Context recall and answer correctness require ground truth, so use them on a curated evaluation set during development, not live traffic.

Section 04 · Span-Level Evaluation

Measuring at the step, not the output

Span-level evaluation logs every intermediate step of an agent run as a named span with its inputs, outputs, latency, and token cost. This is what LangSmith captures by default for LangGraph-based agents.

Each tool call is a span. Each retrieval is a span. Each reasoning step is a span. When an agent run produces a wrong result, you open the trace in LangSmith, find the span where the error originated, and read the exact inputs, outputs, and context that were present at that step. You do not guess — you see it.

This is the property that separates debuggable production systems from brittle ones. Without span-level observability, a wrong agent output is a mystery. With it, the wrong output is a single span you can identify, reproduce, and fix.

Span-level evaluation flow: each agent step (retrieval, reasoning, tool call) is logged as a named span. RAGAS and judge models evaluate spans asynchronously. Dashboards surface threshold breaches. — Span-level evaluation catches failures at the step where they originate. Output evaluation only sees the final result — after the failure has already propagated.

Section 05 · The Evaluation Stack

LangSmith plus RAGAS plus DeepEval: the 2026 production stack

LangSmith for observability

Captures every span for LangGraph-based agents automatically. Stores traces. Supports RAGAS integration. Lets you run evaluators on live traffic samples and historical traces. The minimum viable setup for any production agent.

RAGAS for retrieval quality

Reference-free metrics for faithfulness, answer relevancy, and context precision on live traffic. Run asynchronously on a 5 to 10% sample of production queries. Alert on metric drops below threshold.

DeepEval for behavioral testing

Test suite framework for evaluating agent behavior against curated datasets. Run in CI/CD on every deployment to catch regressions before they reach production. Covers hallucination detection, prompt injection resilience, and custom behavioral metrics.

Section 06 · Production Checklist

The minimum evaluation setup before you ship

Production evaluation checklist for LLM agents
Requirement	Tool	Frequency
Span-level tracing for all agent runs	LangSmith	Always on
Faithfulness above 0.90	RAGAS via LangSmith	Async, 10% sample
Answer relevancy above 0.85	RAGAS via LangSmith	Async, 10% sample
Behavioral regression tests	DeepEval in CI/CD	Every deployment
Tool-call schema validation	Custom validator in pipeline	Every tool call
Human review queue for low-confidence runs	LangSmith dataset	Weekly

FAQ

Frequently asked questions

How do you evaluate AI agents in production?

Run span-level tracing to capture every intermediate step, tool call, and retrieval decision. Use RAGAS metrics asynchronously on a sample of live traffic to monitor faithfulness and answer relevancy. Run behavioral regression tests with DeepEval on every deployment. Avoid blocking the response pipeline on evaluation — run it asynchronously.

What is span-level evaluation for LLM agents?

Span-level evaluation logs each intermediate step of an agent run — each tool call, retrieval step, and reasoning step — as a named span with its inputs, outputs, and context. Evaluating at the span level lets you identify exactly which step produced an error rather than reverse-engineering it from the final output.

What RAGAS metrics should I use for a production RAG agent?

Start with faithfulness and answer relevancy — both are reference-free and can run on live traffic without ground truth labels. Target faithfulness above 0.90 and answer relevancy above 0.85. Add context precision and context recall using a curated evaluation dataset to measure retrieval quality specifically.

Is LangSmith the best evaluation tool for LangGraph agents?

LangSmith is the most integrated option for LangGraph-based agents — it captures spans automatically without instrumentation code, supports RAGAS integration natively, and provides a dataset interface for running evaluations on historical traces. For teams on other frameworks, Arize Phoenix and Langfuse are strong alternatives with similar capability.