Agentic AI Testing: How to QA Non-Deterministic Systems

Key takeaways

Agentic systems are non-deterministic by design: the same input can produce different valid outputs across runs. Traditional assertion-based unit tests cannot catch the failure modes that matter — hallucinated tool arguments, reasoning drift, and cascading retries.
The four-layer testing model covers what production agentic systems actually need: unit tests for deterministic components, trace-based integration tests for step sequences, LLM as judge evaluation for output quality, and chaos testing for failure paths.
Unit tests belong on tool wrappers and context management logic — the parts of the system that have deterministic behavior. Testing these with standard pytest in CI catches the majority of regressions before any LLM call is made.
LLM as judge evaluation uses a separate model to score output quality against a golden dataset of 50 to 100 annotated cases. Using a different model family as the judge reduces self-reinforcing bias significantly.
Chaos testing injects six specific failure modes — tool timeouts, malformed results, model refusals, context overflow, retry storms, and schema drift — to verify the agent handles each gracefully rather than stalling or looping.

Quick answer

How do you test an agentic AI system? Use a four-layer model: unit tests for deterministic components like tool wrappers and context managers, trace-based integration tests that assert tool call order and budget, LLM as judge evaluation against a golden dataset for output quality, and chaos testing to verify failure path handling. Each layer catches a different class of bug. Relying on only one layer leaves significant gaps.

Section 01 · Framing

Why traditional software testing breaks for agents

Traditional tests assume determinism: the same input always produces the same output. Agents are non-deterministic by design. The failure modes that matter are invisible to assertion-based tests.

Standard software testing rests on a single assumption: given the same input, the system produces the same output. Write a test, specify the expected value, assert that the output matches. This model works cleanly for functions with deterministic behavior — parsing, validation, data transformation, API clients — because the behavior is fully specified by the code. The same assumption breaks immediately when applied to systems that use language models for decision-making.

Agentic systems produce non-deterministic outputs by design. The language model at the core of the agent draws samples from a probability distribution, which means the same input can produce different tool call sequences, different reasoning steps, and different final outputs across runs. More importantly, different outputs can all be correct: there are often multiple valid paths through a multi-step task, and the one the model takes on a given run is not predictable. A test that asserts an exact output string will fail on a valid response, and a test that only checks whether the final output looks approximately right will miss invalid intermediate steps.

The failure modes that matter in production agentic systems are not the ones traditional tests catch. Hallucinated tool arguments — syntactically valid but semantically wrong parameter values — fail silently if the tool returns empty results rather than an error. Reasoning drift — the agent gradually losing track of the original objective across a long chain — is invisible unless you are examining the intermediate steps. Cascading retries — an uncapped retry loop triggered by a persistent tool failure — do not produce a test failure; they produce a runaway API bill and a timed-out user session. None of these show up in output-only assertion tests. They require a testing model that examines the agent's behavior at the level of individual steps and failure conditions.

Section 02 · Layer 01

Layer 1: Unit tests for deterministic components

Not all of an agentic system is non-deterministic. Tool wrappers, schema validators, context management logic, and retry handlers are fully deterministic and should be tested with standard pytest.

Every agentic system contains a substantial layer of deterministic code that wraps the non-deterministic LLM. Tool wrappers — the functions that translate LLM-generated argument objects into actual API calls and translate the results back into a format the model can reason from — are fully deterministic. Schema validators that check LLM-generated arguments against a JSON Schema before execution are deterministic. Context window management logic that decides when to summarize, what to retain, and what to drop from the conversation history is deterministic. Retry handlers that implement backoff, cap retry counts, and format error feedback for the model are deterministic. All of these belong in standard unit tests run in CI with no LLM API calls.

Writing unit tests for these components first has a compounding effect on the quality of the rest of the testing program. When you know that the tool wrappers are correct, the schema validators are correct, and the context management is correct, integration test failures can be attributed to the LLM's behavior rather than to bugs in the surrounding infrastructure. This makes trace-based integration tests and chaos tests significantly easier to debug. Teams that skip unit tests on deterministic components find themselves unable to isolate the source of integration test failures, because any failure could be a tool wrapper bug, a schema bug, a context bug, or an actual model behavior issue.

Section 03 · Layer 02

Layer 2: Trace-based integration tests for agent steps

Trace-based tests run the agent and assert on the sequence of tool calls it made — not just the final output. They catch wrong tool selection, missed steps, and budget violations that output-only tests miss.

Four-layer testing model: Unit Tests for deterministic components, Trace Integration for step sequences, LLM as Judge for output quality, Chaos Testing for failure paths

Trace-based integration tests run the agent against a defined task and then assert on the trace of tool calls it produced, rather than on the content of the final output. A trace assertion might verify that the agent called the search tool before the synthesis tool (ordering constraint), that the agent did not exceed a maximum of eight tool calls (budget constraint), that a specific tool was called at least once for a task type where that tool is required (coverage constraint), or that the agent called a specific tool with arguments matching a defined schema (argument shape constraint). These assertions catch a class of bugs that output-only tests cannot: the agent that gets the right answer by taking the wrong path, and the agent that exceeds cost or latency budgets on common task types.

LangSmith and DeepEval are the most widely used tools for trace capture and assertion in production agentic systems. LangSmith provides automatic trace capture for LangChain-based agents and a UI for inspecting and annotating traces. DeepEval provides a test framework with built-in metrics for tool call correctness, faithfulness, and step sequence validation. For teams not using LangChain, both tools have SDK options for manual trace instrumentation. The investment in trace instrumentation pays off immediately in debuggability — when an integration test fails, you have the full step sequence available for diagnosis rather than just the wrong final output.

Section 04 · Layer 03

Layer 3: LLM as judge evaluation for output quality

LLM as judge uses a separate model to score output quality against reference answers. It is the only practical way to evaluate the semantic correctness of free-form agent responses at scale.

Building a useful golden dataset for LLM as judge evaluation requires 50 to 100 annotated cases that cover the real distribution of tasks the agent is deployed for, not a synthetic or idealized distribution. Start from real traces: collect actual agent runs from a pilot or a limited production deployment, select the ones that represent the full range of task types and difficulty levels, and annotate each with the ideal output and a rubric that explains what makes a response good or bad for that task type. The annotation work is significant but not replaceable — synthetically generated ground truth systematically misses the ambiguities and edge cases that real user requests contain.

Use a different model family as the evaluator judge. If your agent runs on Claude, evaluate with GPT-4 or Gemini. If your agent runs on GPT-4, evaluate with Claude. Using the same model family to evaluate its own outputs introduces self-reinforcing bias: the model tends to prefer responses that match its own generation style and to penalize responses from other models, regardless of actual quality. Cross-provider evaluation produces more calibrated scores and surfaces capability differences between models more reliably.

Score on the dimensions that matter for your specific agent: faithfulness (does the response accurately reflect the information retrieved by the tools?), relevance (does the response address what was actually asked?), completeness (did the agent accomplish all parts of a multi-part task?), and safety (did the response avoid content that should be refused?). Faithfulness is the most important dimension for RAG-based agents, where the primary failure mode is generating responses that are not grounded in the retrieved content. For planning agents, completeness is typically the most diagnostic dimension.

Section 05 · Layer 04

Layer 4: Chaos testing for failure paths

Chaos tests inject specific failure conditions and verify the agent handles each one correctly. The six failure modes to cover are tool timeout, malformed result, model refusal, context overflow, retry storm, and schema drift.

Six chaos testing failure modes: Tool Timeout, Malformed Result, Model Refusal, Context Overflow, Retry Storm, Schema Drift, each with correct agent behavior description

Each of the six chaos failure modes requires a specific injection mechanism. Tool timeout is injected by wrapping the tool function in a thin layer that raises a timeout exception after a configurable delay — verify the agent produces a structured error message rather than stalling. Malformed result is injected by returning syntactically invalid JSON or a result with missing required fields — verify the agent validates the result before using it and recovers gracefully. Model refusal is injected by stubbing the LLM client to return a refusal completion for a specific prompt pattern — verify the agent escalates to a human fallback rather than retrying indefinitely. Context overflow is injected by seeding the conversation with a large synthetic history that pushes the agent near the context limit before the test task begins — verify the agent triggers summarization correctly rather than failing with a context length error.

Retry storm and schema drift are the two failure modes that require the most careful injection. Retry storm is tested by injecting a persistent tool failure — one that fails on every invocation — and verifying the agent's retry handler caps at the configured maximum and produces a clear failure message to the user rather than an infinite loop. Schema drift is tested by registering a tool definition that describes different parameters than the tool implementation actually accepts, then verifying the agent's argument validation layer catches the mismatch and surfaces a deployment error rather than silently producing wrong behavior. Both of these failures have happened in production at scale, and both are fully preventable with explicit chaos test coverage.

Section 06 · Implementation

Building a practical testing program

A minimal viable testing program for a production agentic system: what to run in CI, what to run pre-deployment, and what to run on a scheduled cadence.

Blocking CI is the first tier: tests that run on every pull request that touches agent logic and block the merge if they fail. This tier includes all unit tests for deterministic components (tool wrappers, validators, context management, retry handlers) and a small set of trace-based integration tests covering the most critical task types with the most catastrophic failure modes — safety cases, data write operations, and tasks where wrong behavior has direct user-facing consequences. This tier should be fast: under three minutes total, which means mocking all LLM API calls and using recorded tool responses rather than live calls.

Blocking deploy is the second tier: tests that run after every merge to the main branch and block the deployment if they fail. This tier includes the full trace-based integration suite against a staging environment with live LLM calls, the LLM as judge evaluation against the golden dataset, and the chaos testing suite. This tier is slower — 15 to 30 minutes depending on the size of the evaluation dataset and the number of chaos scenarios — but it provides the full coverage needed before deploying new agent behavior to users. A failed deploy-gate test should require an explicit human override to bypass, with the failure logged and investigated.

Scheduled cadence is the third tier: the full evaluation suite including boundary cases, regression cases, and performance benchmarks, run nightly or weekly without blocking deployment. This tier is where you detect slow behavioral drift — gradual changes in output quality, step count efficiency, or safety behavior that happen too slowly to trigger a single-run regression but are visible in the trend data over multiple runs. The scheduled suite generates a report that the team reviews on a defined cadence; anomalies in the trend data are investigated before they become user-facing problems.

Section 07 · FAQ

Frequently asked questions

The questions engineering teams ask most when building their first agentic AI testing program.

How do you test an AI agent?

Use a four-layer model: unit tests for deterministic components like tool wrappers and context managers, trace-based integration tests that assert tool call order and budget, LLM as judge evaluation against a golden dataset for output quality, and chaos testing to verify failure path handling. Each layer catches a different class of bug. Relying on only one layer leaves significant gaps in your coverage.

How do you test non-deterministic AI systems?

You cannot assert exact outputs from non-deterministic systems. Instead, test the deterministic parts (tool wrappers, context management logic) with standard unit tests; test the step sequence with trace-based integration tests that assert call order and budget rather than exact content; and evaluate output quality probabilistically using LLM as judge evaluation against a golden dataset of annotated reference cases.

What is LLM as judge evaluation?

LLM as judge evaluation uses a separate language model to score the output of your agent against reference answers or quality rubrics. It is the practical alternative to human annotation for evaluating free-form responses at scale. Using a different model family as the evaluator reduces self-reinforcing bias. Common scoring dimensions are faithfulness, relevance, completeness, and safety.

What tools exist for testing AI agents in production?

LangSmith and DeepEval are the most widely used for trace-based integration testing and LLM as judge evaluation respectively. Standard pytest works for unit tests on deterministic components. For chaos testing, you write thin wrapper functions that inject specific failure conditions — tool timeouts, malformed JSON, context overflow — and assert on agent recovery behavior.

How do you write unit tests for LLM applications?

Focus unit tests on the deterministic parts: tool schema validation, parameter parsing, context window management logic, retry handler behavior, and any pure functions in your agent code. Mock the LLM API call entirely in unit tests — you are testing your code, not the model. Integration tests handle the combined behavior. This keeps unit tests fast, deterministic, and runnable in CI without API calls.

If you are building or reviewing the testing program for a production agentic system and need help designing the evaluation dataset, wiring the four-layer suite into CI/CD, or running chaos testing for the first time, the agentic AI consulting service covers evaluation and testing as part of its production readiness work.