Quick answer
How do I reduce LLM inference costs in production? Apply five levers in order: route tasks by complexity so simpler work goes to cheaper models; add semantic caching for repeated queries; implement rolling summarization in agent chains to prevent context accumulation; audit and compress system prompts; and use batch inference for non-latency-sensitive work. Together these consistently produce 60 to 80 percent cost reductions without quality tradeoffs visible to users.
Section 01 · Framing
Why agentic systems have a different cost problem
A simple chatbot has one token cost: input plus output per turn. An agentic loop has three: the initial request, every tool call result appended to context, and every reasoning step. Costs compound differently.
A single-turn LLM application has a predictable cost structure: you send a prompt, you receive a completion, and you pay for the tokens in both. Costs scale linearly with request volume. An agentic system breaks this model. A single user request can trigger five to ten LLM calls as the agent reasons through a problem, calls tools, processes results, and generates its response. Each tool call result is appended to the conversation context, which means the input token count on each subsequent call is larger than the one before it. By the end of a moderately complex agentic chain, you may be sending 20,000 to 50,000 input tokens for a task that started with a 200-token user message.
This compounding structure means that cost optimization strategies that work for simple LLM applications are insufficient for agentic ones. Prompt compression helps, but it addresses only the initial system prompt. Cheaper models help, but they may not handle the multi-step reasoning required in agentic tasks. The cost problem in agentic systems requires a different set of interventions, applied at different points in the chain. The five levers in this post address the agentic cost structure specifically — they are not general LLM optimization advice repurposed.
Section 02 · Lever 01
Lever 1: Model routing by task complexity
Routing by task complexity is the first lever to implement and the highest leverage. It cuts 25 to 40 percent of spend without changing any user-facing behavior.
Not all tasks in an agentic system require a frontier model. Classification tasks — determining which tool to call, categorizing an input, routing a request to the right handler — are handled reliably by smaller, cheaper models. Structured extraction tasks — pulling named entities from a document, parsing a date range from a natural language query, converting a user request into a structured parameter object — also route well to smaller models. These tasks are well-defined, have clear correct answers, and do not require the broad knowledge or multi-step reasoning that frontier models excel at.
Synthesis tasks and planning tasks are the right workload for frontier models. Generating a coherent response that integrates multiple data sources, drafting a structured document, reasoning through a multi-step problem with ambiguous intermediate states — these tasks benefit from the capability gap that frontier models provide. A well-implemented routing layer sends each subtask in the agentic chain to the cheapest model that can handle it reliably, reserving the frontier model for the work where it genuinely matters.
The routing layer itself is a lightweight classifier that examines the incoming task type and the current agent state. It does not need to be complex — a small classification model or even a rules-based router covering the most common task categories is sufficient for most systems. Implement it early: the routing infrastructure pays dividends immediately, and it becomes easier to extend as you identify more task categories that route well to cheaper models. See the LLM function calling guide for how model routing interacts with tool selection in agentic loops.
Section 03 · Lever 02
Lever 2: Semantic caching for repeated queries
Semantic caching stores responses to previous queries and returns them for semantically similar new queries without calling the API. FAQ-style and analytical patterns see the highest hit rates.
Semantic caching works by storing a vector embedding of each previous query alongside its response. When a new query arrives, the system computes its embedding and runs a similarity search against the cache. If the new query is semantically close enough to a previous one — typically using a cosine similarity threshold of 0.92 to 0.96 depending on the application's tolerance for approximate responses — the cached response is returned directly without an API call. The threshold calibration is important: too high and you miss valid cache hits; too low and you return semantically close but contextually wrong cached responses.
The use cases with the highest cache hit rates are FAQ-style interactions — where a small number of question types cover the majority of user queries — and analytical or reporting queries on stable datasets, where the same question about the same data returns the same answer regardless of when it is asked. Production deployments in these patterns typically see 25 to 45 percent cache hit rates, translating to 10 to 30 percent savings on total API spend. Cache TTL management matters: responses cached from stale data or superseded information need expiration policies, or they erode user trust faster than they save money.
Section 04 · Lever 03
Lever 3: Context window hygiene in agent chains
In agentic loops, every tool call result is appended to the conversation. Without active management, context accumulates hundreds of thousands of tokens of intermediate state the model rarely revisits.
Rolling summarization is the standard technique for context window hygiene in agentic systems. After a defined number of tool call cycles — typically three to five — the system uses an LLM call to compress the accumulated intermediate results into a summary, replaces the raw intermediate content with the summary, and continues the chain from the compressed state. The model continues reasoning from the summary rather than the full intermediate history. This reduces context size by 60 to 70 percent per summarization pass while preserving the information the model needs for subsequent reasoning steps.
The summarization call itself costs tokens, but the net economics are strongly positive: a summarization pass that costs 1,500 tokens and removes 15,000 tokens of intermediate state from all subsequent calls saves 13,500 tokens on every following call in the chain. For chains with five or more subsequent LLM calls after the summarization point, the savings compound significantly. Implement rolling summarization with a configurable window size so you can tune the tradeoff between summarization overhead and context compression for different agent types and task lengths.
Section 05 · Lever 04
Lever 4: Prompt compression
System prompts accumulate redundancy over time. Auditing and removing duplicate instructions, obsolete examples, and verbose formatting cuts input tokens by 10 to 20 percent on every request.
System prompts in production agentic systems tend to grow over time as the team adds instructions to fix edge cases, appends examples for new task types, and documents behavioral constraints. Most prompts that have been in production for more than six months contain significant redundancy: instructions that appear in multiple places with slightly different phrasing, examples that address problems the system no longer encounters, and formatting guidance that is more verbose than necessary. A structured audit — reading the system prompt line by line and asking whether each instruction is still necessary and whether it already appears elsewhere — typically identifies 10 to 20 percent of tokens as removable without any behavioral change.
Prompt compression techniques go beyond manual auditing. LLMLingua and similar prompt compression tools use smaller models to identify and remove low-information tokens from prompts while preserving the core semantic content. These tools achieve 2x to 4x prompt compression with minimal task performance degradation on well-tested prompts. The appropriate level of compression depends on the task sensitivity: prompts for safety-critical decisions warrant conservative compression with thorough regression testing; prompts for low-stakes classification tasks can tolerate more aggressive compression with spot-check validation.
Section 06 · Lever 05
Lever 5: Batch inference for non-latency-sensitive work
Batch inference APIs process requests asynchronously, typically at 40 to 60 percent lower cost than synchronous APIs. The tradeoff is latency — results arrive in minutes to hours, not milliseconds.
Both Anthropic and OpenAI offer batch inference APIs that process large volumes of requests at significantly reduced per-token pricing — typically 40 to 60 percent below the synchronous API rate. The tradeoff is latency: batch requests are processed asynchronously and results may arrive anywhere from a few minutes to a few hours after submission, depending on queue depth and request volume.
The eligible workloads are those where the result does not need to be returned to a user in real time. Nightly report generation, document classification pipelines, data enrichment workflows, embedding generation for new content, and evaluation runs are all well-suited to batch inference. For a company with a $20,000 per month synchronous API bill, identifying even 20 to 30 percent of requests that are eligible for batch processing and migrating them reduces the bill by $4,000 to $6,000 per month. Combined with the other four levers, batch inference is the final push that brings the total reduction into the 60 to 80 percent range.
Section 07 · Combined Impact
The cost waterfall: applying levers in order
Each lever compounds with the others. Applied in the right order, the five levers consistently reduce spend by 60 to 80 percent without quality loss.
The ordering matters. Model routing comes first because it reduces the baseline spend on every subsequent operation — fewer frontier model calls means the savings from caching and context hygiene apply to a lower starting cost. Semantic caching comes second because it eliminates entire LLM calls, which means context hygiene, prompt compression, and batch migration only need to be applied to the uncached workload. Context hygiene and prompt compression come next because they reduce the per-call token cost for all remaining synchronous calls. Batch inference comes last, migrating the remaining eligible workload to the lowest-cost tier. Applied in this sequence, the levers are complementary and their effects compound cleanly.
Section 08 · Anti-Patterns
What not to do
Three anti-patterns that look like cost optimizations but create quality or reliability problems: cheap model for everything, aggressive caching without TTL, and cutting system prompt instructions.
Routing every task to the cheapest model is the most common anti-pattern. Teams see the cost savings from routing classification tasks to small models and generalize the approach to the entire workload. Synthesis tasks, multi-step reasoning, and nuanced judgment calls routed to small models produce lower quality outputs that erode user trust faster than the cost savings justify. The routing decision needs to be task-specific.
Aggressive semantic caching without TTL management creates a different class of problem. Cached responses become stale when the underlying data, policies, or product behavior changes. A cache hit on a stale response returns confidently wrong information — worse than no cache at all. Every cached response needs an expiration policy tied to the expected update frequency of the information it contains. Responses about time-invariant facts can have long TTLs; responses about current system state, prices, or user-specific data need short ones.
Cutting system prompt instructions to reduce tokens without regression testing is the third anti-pattern. System prompts encode behavioral constraints, safety guidelines, and task-specific instructions that were added for reasons that may not be obvious from reading the prompt. Removing instructions that look redundant without testing the effect on output quality removes those constraints silently. Any prompt compression beyond cosmetic whitespace removal should be followed by a regression run against a representative evaluation set before deploying to production.
Section 09 · FAQ
Frequently asked questions
The questions engineering teams ask most when tackling LLM inference cost in production.
How do I reduce my OpenAI or Anthropic API bill?
Apply five levers in order: route tasks by complexity to cheaper models, add semantic caching for repeated queries, implement rolling summarization in agent chains, audit and compress system prompts, and use batch inference for non-latency-sensitive work. Together these consistently produce 60 to 80 percent cost reductions without quality tradeoffs visible to users.
What is semantic caching for LLMs?
Semantic caching stores responses to previous queries and returns them for semantically similar new queries without calling the API. A vector similarity lookup against a cache of previous request-response pairs determines whether a cached response is close enough to return. FAQ-style interactions and repeated analytical queries typically see 25 to 45 percent cache hit rates, saving 10 to 30 percent of total API spend.
How do I optimize LLM costs in an agentic system?
Focus on context window hygiene first. In agentic loops, every tool call result is appended to the conversation, and without active management, context accumulates hundreds of thousands of tokens the model rarely revisits. Rolling summarization of intermediate results reduces context size by 60 to 70 percent per chain. Then add model routing so simpler subtasks go to cheaper models, and batch inference for any non-latency-sensitive operations.
What is model routing and how does it reduce LLM costs?
Model routing classifies each incoming request by complexity and routes it to the most cost-efficient model that can handle it. Classification, extraction, and structured output tasks typically route to smaller, cheaper models. Synthesis, planning, and multi-step reasoning tasks route to frontier models. Well-implemented routing cuts 25 to 40 percent of total API spend without visible quality loss to users.
What is Anthropic prompt caching?
Anthropic prompt caching stores the KV cache for a specific prefix of a request — typically a long system prompt or a large document — so subsequent requests that share the same prefix do not reprocess those tokens. It reduces input token costs for applications where a large static context is reused across many requests, such as document question-answering or RAG with a fixed knowledge base.
If you are running a production agentic system and need a structured cost audit or architecture review to identify and implement these levers, the agentic AI consulting service covers cost optimization as part of its architecture and production readiness work.