LLM Observability: Metrics, Tools and Production Setup

Key takeaways

Standard application monitoring tools (Datadog APM, New Relic, CloudWatch) do not capture what matters for LLM systems: which context was retrieved, which model version was called, what the token count was, whether the output was faithful to the source.
LLM observability requires four metric categories: traces and spans per LLM call, token usage and cost per query, latency at the 50th and 95th percentile, and output quality metrics like faithfulness, groundedness, or a judge model score.
The main tools in the space: Arize Phoenix (open source, strong evaluation integration), Langfuse (open source, LangChain native), Helicone (proxy based, zero code setup), Datadog LLM Observability (managed, enterprise compliance focus).
Wire observability from day one, not after something breaks. The debugging cost of uninstrumented production LLM systems is very high.
For regulated industries (healthcare, finance, legal), observability is not optional. It is the audit trail that demonstrates your AI system behaved as designed.

Section 01 · Why APM falls short

Why standard APM tools fall short for LLM systems

If you have instrumented a production web application with Datadog, New Relic, or CloudWatch, you know what these tools are good at: request latency, error rates, CPU and memory, database query times. They give you a picture of whether your system is up and whether it is slow.

Quick answer

In one sentence: LLM observability is the practice of instrumenting a production AI system to capture traces, token usage, latency, and output quality metrics at the LLM call level, giving you the visibility to debug failures, control costs, detect quality degradation, and satisfy compliance requirements that standard application monitoring tools cannot address.

Standard APM tools tell you almost nothing useful about an LLM system. When an LLM call fails or produces a bad output, the questions you need to answer are: what prompt was sent to the model? What context was retrieved and included? Which model version was called? How many tokens were consumed, and what was the cost? Did the output faithfully reflect the retrieved context, or did the model hallucinate? What was the latency at the 95th percentile across the last 1,000 queries?

Standard APM tools capture none of this. They see an HTTP request to an LLM API endpoint and a response. The content, the thing that determines whether your system is working correctly, is opaque to them.

This is why LLM observability has emerged as its own category. It is not a replacement for APM; you still need to know if your servers are up. It is a parallel layer that captures the LLM specific signals that determine whether your AI system is doing what it was designed to do.

The AI governance guide for production LLM systems covers how observability fits into a broader compliance and audit framework, which is particularly relevant for regulated industries.

Section 02 · Metrics

The four metric categories that matter

Not all LLM metrics are equally useful. After instrumenting a number of production systems, four categories consistently determine whether an observability setup is actionable.

Traces and spans

A trace is the full record of one user request as it flows through your system: the initial query, the retrieval step, the prompt construction, the model call, and the final response. Each component within the trace is a span with its own timing and payload. Traces let you see exactly what happened during a specific request, essential when a user reports a bad output and you need to reconstruct the path that produced it.

Token usage and cost per query

LLM APIs charge by the token. Without per query token tracking, you have no ability to forecast cost, identify which users or query patterns are expensive, or set budget alerts before an unexpected spike hits your invoice. A query that uses 10,000 tokens costs 50 times more than a query that uses 200 tokens. You want to know which queries are in which category.

Latency at the 50th and 95th percentile

The average latency of your LLM calls is almost meaningless. The 95th percentile is the number that determines user experience under load. A system with 800ms median latency and 8,000ms p95 latency is one that a significant fraction of users experience as broken. Track p50 and p95 separately, and set alerts on p95 rather than average.

Output quality metrics

This is the hardest category and the one that most teams skip. Output quality metrics include faithfulness (does the output accurately reflect the retrieved context?), groundedness (is every claim in the output supported by source documents?), and coherence (is the output semantically sensible?). These require either a judge model call (using a second LLM to evaluate the primary one) or a reference dataset for comparison.

The LLM agent evaluation framework covers how to build evaluation pipelines that produce these metrics at scale.

Section 03 · Tools

Tool landscape: the four main options

The LLM observability tooling space is moving fast, but four tools cover the majority of production use cases.

Arize Phoenix

Open source, runs locally or on your infrastructure, and has the tightest integration with evaluation workflows. Phoenix captures traces, supports evaluation datasets, and lets you run judge model evaluations against historical traces to measure quality over time. Strong choice if you want full data control and a deep evaluation integration. Works with LangChain, LlamaIndex, and raw OpenAI or Anthropic API calls via OpenTelemetry.

Langfuse

Open source (self hosted or cloud), LangChain native via a callback handler, and very fast to set up. Langfuse captures traces and spans automatically when you add the callback to your LangChain chains. It supports user feedback collection (thumbs up or down from UI) which can seed your evaluation dataset. Strong choice for teams already in the LangChain ecosystem who want to be instrumented in under an hour.

Helicone

Proxy based: you route your LLM API calls through Helicone's proxy instead of calling the model API directly. Zero code changes required if you are calling the API directly; one environment variable swap. Captures cost, latency, and token metrics automatically. Does not require you to instrument your application. Weaker on evaluation and trace level debugging than Phoenix or Langfuse. Strong choice for fast setup with minimal engineering effort, especially if you have a mix of model providers.

Datadog LLM Observability

Managed, enterprise compliance focus, integrates with Datadog's existing APM and alerting infrastructure. Strong choice if you are already a Datadog shop and want LLM observability in the same platform as your infrastructure monitoring. More expensive than the open source options. Has the strongest compliance and audit log story for regulated industries.

Four main LLM observability tools compared on model, setup effort, evaluation depth, and compliance story.
Tool	Model	Setup effort	Evaluation depth	Compliance story
Arize Phoenix	Open source	Medium	High	Self hosted
Langfuse	Open source	Low (LangChain)	Medium	Self hosted or cloud
Helicone	SaaS	Very low	Low	Cloud
Datadog LLM Obs	SaaS	Medium	Medium	Enterprise managed

Section 04 · Implementation

How to wire observability into a LangChain or LangGraph stack

The mechanics depend on which tool you choose, but the pattern is similar across all of them: instrument at the chain or graph level, not at the individual model call level.

With Langfuse and LangChain. Initialize a callback handler with your public and secret keys, then pass it into the chain invocation as part of the callbacks config. The callback handler intercepts every LangChain event, model calls, retrieval steps, tool invocations, and logs them to Langfuse. No other code changes required.

With Arize Phoenix and LangGraph. Phoenix uses OpenTelemetry. You register a tracer provider once at application startup with a project name and the Phoenix collector endpoint. LangGraph instruments automatically once the tracer provider is registered. Every node execution becomes a span in the trace.

With Helicone. Replace the base URL in your OpenAI or Anthropic client initialization to point at the Helicone proxy endpoint, and pass your Helicone auth key in the default headers. All subsequent API calls route through Helicone's proxy and appear in the dashboard automatically.

The important point: instrument at the start of the project, not after something goes wrong. Retroactively instrumenting a production LLM system to debug a live issue is frustrating in proportion to how much it would have cost to add from the beginning.

Section 05 · Compliance

Regulated industries and compliance

For healthcare, finance, and legal applications, LLM observability serves two functions that go beyond engineering convenience: it is the audit trail that demonstrates your AI system behaved as designed, and it is the detection layer that identifies compliance violations before they scale.

In practice this means: every LLM call must be logged with the full prompt, the retrieved context, the model response, and a timestamp. The log must be tamper evident (immutable once written) and retained for the regulatory period applicable to your industry. Any output that triggered a human review should link the review decision back to the original LLM trace.

Self hosted tools (Phoenix, Langfuse) give you full control over where logs are stored and who has access. Datadog LLM Observability's enterprise tier has SOC 2 and HIPAA compliance documentation if your organization requires managed infrastructure with formal compliance certifications.

If you are building for a regulated industry and are not sure how observability fits into your broader AI governance framework, the AI systems architecture service covers compliance aware system design as a first class concern.

Section 06 · FAQ

Frequently asked questions

The questions engineering leads ask most before instrumenting a production LLM system.

What is LLM observability?

LLM observability is the practice of instrumenting a production AI system to capture traces, token usage, latency, and output quality metrics at the LLM call level. It gives you visibility into what your AI system is doing, why it behaves the way it does, and when its behavior degrades, visibility that standard application monitoring tools cannot provide.

What is the difference between LLM observability and standard application monitoring?

Standard application monitoring captures infrastructure metrics: latency, error rates, memory, CPU. LLM observability captures AI specific signals: which prompt was sent, which context was retrieved, how many tokens were consumed, what the model responded, and whether the response was accurate. The two layers are complementary, you need both in production.

Which LLM observability tool should I use?

For teams in the LangChain ecosystem who want fast setup: Langfuse. For teams who need deep evaluation integration and full data control: Arize Phoenix. For teams who want zero code changes: Helicone. For enterprise teams already using Datadog: Datadog LLM Observability. Match the tool to your existing infrastructure and compliance requirements.

Do I need LLM observability for a prototype?

No. Prototypes do not need production grade instrumentation. But if you ship a prototype to real users without instrumentation, you lose the ability to debug any issues users report, and you have no baseline to compare against when you add observability later. Adding a Langfuse callback handler to a LangChain prototype takes under 10 minutes and saves significant debugging time.

How do I measure output quality for LLM systems?

Output quality measurement requires either a judge model (a second LLM call that evaluates the primary output for faithfulness, groundedness, or coherence) or a reference dataset of known good responses for comparison. The judge model approach scales automatically; the reference dataset approach is more reliable but requires ongoing maintenance. Most production teams use both: reference datasets for regression testing, judge models for real time quality signals.