Section 01 · The Core Problem
Why evaluating agents is different from evaluating LLM calls
A single LLM call either answers the question well or it does not. An agent run makes 20 to 100 decisions in sequence. A failure at step 7 can produce a plausible-looking final output that is completely wrong.
Quick answer
The short answer: Agent evaluation must happen at the span level — each tool call, retrieval decision, and reasoning step — not just at the final output. Output evaluation catches failures after they have already propagated through the pipeline.
The standard for evaluating a chatbot — does the output answer the question, is it factually accurate, does it match the style guide — is insufficient for agents. An agent that retrieves the wrong document, calls the right tool with the wrong parameters, or misclassifies a user intent at step 3 will often produce a confident-looking final output. By the time you evaluate the output, the error has already propagated across the remaining steps.
Nearly half of agentic AI projects are predicted to be cancelled in 2026 for lack of proper evaluation infrastructure. Teams ship, get inconsistent results, cannot diagnose why, and lose confidence in the system. The fix is not a better model — it is better measurement at the step level.
Section 02 · Failure Categories
The three failure categories you need to measure
Retrieval failures
The agent retrieves the wrong documents, retrieves too few, or retrieves contextually irrelevant chunks. Downstream reasoning is then grounded in wrong information. RAGAS context precision and context recall measure this. Target context precision above 0.80 and context recall above 0.75.
Reasoning failures
The agent has the right context but draws the wrong conclusion, misclassifies an intent, or chooses the wrong tool for the task. These are harder to measure automatically and often require a separate judge model or a curated evaluation dataset with known-correct reasoning paths.
Action failures
The agent calls the right tool with wrong parameters, calls the wrong tool, or takes an action that is technically valid but contextually inappropriate. Span-level logging of every tool call with its parameters, return value, and the agent's subsequent reasoning step is the only way to catch these consistently.
Section 03 · RAGAS Metrics
The five RAGAS metrics for production RAG-based agents
| Metric | What it measures | Target |
|---|---|---|
| Faithfulness | Claims in the answer are supported by retrieved context | Above 0.90 |
| Answer relevancy | Answer addresses what the question asked | Above 0.85 |
| Context precision | Retrieved chunks are relevant to the question | Above 0.80 |
| Context recall | All information needed to answer was retrieved | Above 0.75 |
| Answer correctness | Answer is factually correct vs ground truth | Above 0.80 |
RAGAS runs without ground truth labels for faithfulness, answer relevancy, and context precision. This makes it practical to run on live production traffic where you do not have human-verified correct answers for every query. Context recall and answer correctness require ground truth, so use them on a curated evaluation set during development, not live traffic.
Section 04 · Span-Level Evaluation
Measuring at the step, not the output
Span-level evaluation logs every intermediate step of an agent run as a named span with its inputs, outputs, latency, and token cost. This is what LangSmith captures by default for LangGraph-based agents.
Each tool call is a span. Each retrieval is a span. Each reasoning step is a span. When an agent run produces a wrong result, you open the trace in LangSmith, find the span where the error originated, and read the exact inputs, outputs, and context that were present at that step. You do not guess — you see it.
This is the property that separates debuggable production systems from brittle ones. Without span-level observability, a wrong agent output is a mystery. With it, the wrong output is a single span you can identify, reproduce, and fix.
Section 05 · The Evaluation Stack
LangSmith plus RAGAS plus DeepEval: the 2026 production stack
LangSmith for observability
Captures every span for LangGraph-based agents automatically. Stores traces. Supports RAGAS integration. Lets you run evaluators on live traffic samples and historical traces. The minimum viable setup for any production agent.
RAGAS for retrieval quality
Reference-free metrics for faithfulness, answer relevancy, and context precision on live traffic. Run asynchronously on a 5 to 10% sample of production queries. Alert on metric drops below threshold.
DeepEval for behavioral testing
Test suite framework for evaluating agent behavior against curated datasets. Run in CI/CD on every deployment to catch regressions before they reach production. Covers hallucination detection, prompt injection resilience, and custom behavioral metrics.
Section 06 · Production Checklist
The minimum evaluation setup before you ship
| Requirement | Tool | Frequency |
|---|---|---|
| Span-level tracing for all agent runs | LangSmith | Always on |
| Faithfulness above 0.90 | RAGAS via LangSmith | Async, 10% sample |
| Answer relevancy above 0.85 | RAGAS via LangSmith | Async, 10% sample |
| Behavioral regression tests | DeepEval in CI/CD | Every deployment |
| Tool-call schema validation | Custom validator in pipeline | Every tool call |
| Human review queue for low-confidence runs | LangSmith dataset | Weekly |
FAQ
Frequently asked questions
How do you evaluate AI agents in production?
Run span-level tracing to capture every intermediate step, tool call, and retrieval decision. Use RAGAS metrics asynchronously on a sample of live traffic to monitor faithfulness and answer relevancy. Run behavioral regression tests with DeepEval on every deployment. Avoid blocking the response pipeline on evaluation — run it asynchronously.
What is span-level evaluation for LLM agents?
Span-level evaluation logs each intermediate step of an agent run — each tool call, retrieval step, and reasoning step — as a named span with its inputs, outputs, and context. Evaluating at the span level lets you identify exactly which step produced an error rather than reverse-engineering it from the final output.
What RAGAS metrics should I use for a production RAG agent?
Start with faithfulness and answer relevancy — both are reference-free and can run on live traffic without ground truth labels. Target faithfulness above 0.90 and answer relevancy above 0.85. Add context precision and context recall using a curated evaluation dataset to measure retrieval quality specifically.
Is LangSmith the best evaluation tool for LangGraph agents?
LangSmith is the most integrated option for LangGraph-based agents — it captures spans automatically without instrumentation code, supports RAGAS integration natively, and provides a dataset interface for running evaluations on historical traces. For teams on other frameworks, Arize Phoenix and Langfuse are strong alternatives with similar capability.