Section 01 · The Core Problem
Why evaluating agents is different from evaluating LLM calls
A single LLM call either answers the question well or it does not. An agent run makes 20 to 100 decisions in sequence. A failure at step 7 can produce a plausible-looking final output that is completely wrong.
Quick answer
The short answer: Agent evaluation must happen at the span level — each tool call, retrieval decision, and reasoning step — not just at the final output. Output evaluation catches failures after they have already propagated through the pipeline.
The standard for evaluating a chatbot — does the output answer the question, is it factually accurate, does it match the style guide — is insufficient for agents. An agent that retrieves the wrong document, calls the right tool with the wrong parameters, or misclassifies a user intent at step 3 will often produce a confident-looking final output. By the time you evaluate the output, the error has already propagated across the remaining steps.
Nearly half of agentic AI projects are predicted to be cancelled in 2026 for lack of proper evaluation infrastructure. Teams ship, get inconsistent results, cannot diagnose why, and lose confidence in the system. The fix is not a better model — it is better measurement at the step level.
Section 02 · Failure Categories
The three failure categories you need to measure
Retrieval failures
The agent retrieves the wrong documents, retrieves too few, or retrieves contextually irrelevant chunks. Downstream reasoning is then grounded in wrong information. RAGAS context precision and context recall measure this. Target context precision above 0.80 and context recall above 0.75.
Reasoning failures
The agent has the right context but draws the wrong conclusion, misclassifies an intent, or chooses the wrong tool for the task. These are harder to measure automatically and often require a separate judge model or a curated evaluation dataset with known-correct reasoning paths.
Action failures
The agent calls the right tool with wrong parameters, calls the wrong tool, or takes an action that is technically valid but contextually inappropriate. Span-level logging of every tool call with its parameters, return value, and the agent's subsequent reasoning step is the only way to catch these consistently.
Section 03 · RAGAS Metrics
The five RAGAS metrics for production RAG-based agents
| Metric | What it measures | Target |
|---|---|---|
| Faithfulness | Claims in the answer are supported by retrieved context | Above 0.90 |
| Answer relevancy | Answer addresses what the question asked | Above 0.85 |
| Context precision | Retrieved chunks are relevant to the question | Above 0.80 |
| Context recall | All information needed to answer was retrieved | Above 0.75 |
| Answer correctness | Answer is factually correct vs ground truth | Above 0.80 |
RAGAS runs without ground truth labels for faithfulness, answer relevancy, and context precision. This makes it practical to run on live production traffic where you do not have human-verified correct answers for every query. Context recall and answer correctness require ground truth, so use them on a curated evaluation set during development, not live traffic.
Section 04 · Span-Level Evaluation
Measuring at the step, not the output
Span-level evaluation logs every intermediate step of an agent run as a named span with its inputs, outputs, latency, and token cost. This is what LangSmith captures by default for LangGraph-based agents.
Each tool call is a span. Each retrieval is a span. Each reasoning step is a span. When an agent run produces a wrong result, you open the trace in LangSmith, find the span where the error originated, and read the exact inputs, outputs, and context that were present at that step. You do not guess — you see it.
This is the property that separates debuggable production systems from brittle ones. Without span-level observability, a wrong agent output is a mystery. With it, the wrong output is a single span you can identify, reproduce, and fix.
Section 05 · The Evaluation Stack
LangSmith plus RAGAS plus DeepEval: the 2026 production stack
LangSmith for observability
Captures every span for LangGraph-based agents automatically. Stores traces. Supports RAGAS integration. Lets you run evaluators on live traffic samples and historical traces. The minimum viable setup for any production agent.
RAGAS for retrieval quality
Reference-free metrics for faithfulness, answer relevancy, and context precision on live traffic. Run asynchronously on a 5 to 10% sample of production queries. Alert on metric drops below threshold.
DeepEval for behavioral testing
Test suite framework for evaluating agent behavior against curated datasets. Run in CI/CD on every deployment to catch regressions before they reach production. Covers hallucination detection, prompt injection resilience, and custom behavioral metrics.
Section 06 · Production Checklist
The minimum evaluation setup before you ship
| Requirement | Tool | Frequency |
|---|---|---|
| Span-level tracing for all agent runs | LangSmith | Always on |
| Faithfulness above 0.90 | RAGAS via LangSmith | Async, 10% sample |
| Answer relevancy above 0.85 | RAGAS via LangSmith | Async, 10% sample |
| Behavioral regression tests | DeepEval in CI/CD | Every deployment |
| Tool-call schema validation | Custom validator in pipeline | Every tool call |
| Human review queue for low-confidence runs | LangSmith dataset | Weekly |
Section 07 · Research Foundations
How to evaluate LLM agents: what the research says
The academic literature on LLM agent evaluation has grown rapidly. The key papers that define current production practice — and that Google surfaces for the query 'how to evaluate LLM agents' — are worth understanding.
Quick answer
How to evaluate LLM agents: Evaluate at three levels: span-level (each tool call and retrieval step), task-level (did the agent complete the goal), and trajectory-level (was the sequence of decisions optimal). Academic research in 2026 has established multi-agent debate, analytical evaluation boards, and longitudinal memory tests as the leading methodologies.
ChatEval: multi-agent debate as an evaluation mechanism
ChatEval (published 2024) proposes that LLM-based evaluators are more accurate when structured as a multi-agent debate — multiple evaluator models argue for and against an answer before reaching a consensus verdict. The paper demonstrates that debate-based evaluation outperforms single-judge evaluation on factual accuracy and reasoning quality metrics. For production teams, the practical implication is that using a single LLM judge produces biased evaluations; a panel of two or three models debating the same output produces more reliable scores.
AgentBench and AgentBoard: multitask evaluation across domains
AgentBench evaluates LLM agents across eight distinct environments — operating system interactions, database queries, knowledge graph tasks, web navigation, and others. AgentBoard extends this to an analytical evaluation board that breaks performance down by task type and difficulty level, showing that current frontier models (GPT-5.4, Claude Sonnet 4.6) excel at sequential reasoning but still fail on tasks requiring precise multi-step planning across five or more dependent actions. For production evaluation, this informs which task types need more human review.
Evaluating very long-term conversational memory of LLM agents
A 2025 paper from researchers at Stanford and DeepMind specifically tests agent memory over hundreds of conversation turns. The finding most relevant to production: agents that appear to maintain memory in testing consistently fail when the conversation history exceeds the effective attention window, even when context technically fits in the model's nominal context length. This has a direct production implication — evaluating agent memory over short sessions is not predictive of production performance in long running tasks. Design evaluations that run across the full expected task duration.
Section 08 · FAQ
Frequently asked questions
How do you evaluate AI agents in production?
Run span-level tracing to capture every intermediate step, tool call, and retrieval decision. Use RAGAS metrics asynchronously on a sample of live traffic to monitor faithfulness and answer relevancy. Run behavioral regression tests with DeepEval on every deployment. Avoid blocking the response pipeline on evaluation — run it asynchronously.
What is span-level evaluation for LLM agents?
Span-level evaluation logs each intermediate step of an agent run — each tool call, retrieval step, and reasoning step — as a named span with its inputs, outputs, and context. Evaluating at the span level lets you identify exactly which step produced an error rather than reverse-engineering it from the final output.
What RAGAS metrics should I use for a production RAG agent?
Start with faithfulness and answer relevancy — both are reference-free and can run on live traffic without ground truth labels. Target faithfulness above 0.90 and answer relevancy above 0.85. Add context precision and context recall using a curated evaluation dataset to measure retrieval quality specifically.
Is LangSmith the best evaluation tool for LangGraph agents?
LangSmith is the most integrated option for LangGraph-based agents — it captures spans automatically without instrumentation code, supports RAGAS integration natively, and provides a dataset interface for running evaluations on historical traces. For teams on other frameworks, Arize Phoenix and Langfuse are strong alternatives with similar capability.
What is LLM agent evaluation?
LLM agent evaluation is the systematic measurement of an AI agent's performance across the full run — not just the final output. It includes span-level evaluation (measuring each tool call, retrieval step, and reasoning decision), task-level evaluation (did the agent complete the objective), and trajectory evaluation (was the sequence of actions optimal or did the agent take unnecessary or harmful detours). For RAG-based agents, RAGAS provides standardized metrics. For general agents, LangSmith, DeepEval, and AgentBench are the primary evaluation frameworks in 2026.
How do you evaluate LLM agent memory over long conversations?
Run evaluation sessions that match or exceed the expected production task duration. Academic research shows that agent memory reliably degrades when conversation history approaches the model's effective attention window — even when context nominally fits in the token limit. Measure performance at 10, 50, 100, and 200 turns to find the degradation point. For production agents with long running tasks, implement explicit memory summarization and context compression to maintain performance beyond the degradation threshold.