AI Architecture - Cost Engineering

LLM Pipeline Cost Calculator

Forecast LLM cost for a single OpenAI, Claude, or Gemini call, a multi node pipeline, an AI agent loop, or a multi agent crew before your bill surprises you.

Author: Mudassir Khan. Last updated May 23, 2026.

LLM Pipeline Cost Calculator illustrationA responsive schematic diagram representing the tool workflow from inputs through calculation to recommendation.inputsmodelanswer

Monthly total

$2,666

4 model calls per request

Cost per request

$0.0267

Anthropic Claude Sonnet 4.6

Top cost driver

output tokens

Input 32% · Output 68%

  • Formula: per call cost x calls per request x volume, with retry multiplier, cache discount on input, and 50% off if batch is on.
  • Switch model to GPT-5 nano: monthly cost drops to $63 (saves about 98%). Validate quality with a small eval set first.
  • Raise prompt cache hit rate. Stable system prompts and retrieved context that repeat across requests can lift cache hit to 40-70%, cutting input tokens by roughly that fraction.
  • Move offline workloads to the Batch API. 50% off list price on every supported model, with no quality change for jobs that tolerate a delay.
  • Pricing verified 2026-05-23 from https://www.anthropic.com/news/claude-sonnet-4-6.

Direct answer

An LLM cost calculator turns a vendor's per token price into a real monthly bill by multiplying it across every node, branch, retry, tool call iteration, and turn over turn context growth in your workflow. A $0.002 model can drive a $40 user action once the full trace is in front of you, which is why a production OpenAI, Claude, or Gemini deployment almost never matches the headline price.

Customer support agent loop

Input: 10,000 monthly sessions, 4 turns per session, 3 tool calls per turn, 20% context carry forward, 8% retry rate, 30% prompt cache hit rate, Claude Haiku 4.5.

Output: Total monthly cost, cost per session, cost per user per month, and which lever moves the bill most when you flip it.

How to use this tool

  1. 1. Pick a workload type and template.
  2. 2. Set monthly request or session volume.
  3. 3. Adjust retry, cache, batch, and agent loop settings.
  4. 4. Review cost drivers and optimisation suggestions.

OpenAI API cost, Claude API cost, and Gemini per token in one place

OpenAI API cost ranges from $0.05 per million input tokens on GPT-5 nano to $5 per million on the GPT-5.5 reasoning tier. Claude API cost sits at $1 per million for Haiku 4.5, $3 for Sonnet 4.6, and $5 for Opus 4.7, with output tokens billed roughly five times the input rate. Gemini API cost runs from $0.10 per million for Flash Lite to $1.25 for Pro. The calculator carries the full table for all three providers and lets you switch model per node, so you can see what an OpenAI to Claude to Gemini swap actually costs.

Reasoning or thinking tokens from o series, GPT-5 thinking, Claude extended thinking, and Gemini thinking are billed at the output rate and can dominate the bill if they are left uncapped. The calculator has a dedicated reasoning tokens field so the forecast never silently undercounts them. When provider prices change, edit the pricing table in code and every cost figure on the page updates with it.

How much does an AI agent cost to operate

AI agent development cost is one number. AI agent operating cost is a different number, and it is the one that shows up on the invoice every month. A single API call costs pennies. An agent that runs a five turn conversation, calls three tools per turn, carries forward context across turns, and occasionally needs a reasoning pass can run two orders of magnitude higher per session. The cost shape is closer to a compounding loop than a flat per request number, which is why a simple price per token estimate underforecasts the bill by a wide margin.

This calculator treats single calls, pipelines, agent loops, and multi agent crews as the same problem with different multipliers. For agent workloads it asks for tool calls per turn, turns per session, and the percent of context that carries forward turn over turn, then computes cost per session and cost per user per month alongside the headline monthly total. Cost per task for an AI agent is the same calculation viewed at a different unit of work.

Why agent loops cost more than pipelines

A pipeline runs a fixed graph: four nodes in, four model calls out, predictable cost. An agent loop runs an open ended loop: the model decides how many tool calls to make before answering, and each turn the context window grows because earlier messages stay in scope. A workflow that averages three tool calls and four turns costs about twelve model calls per user session, before any retries or critique passes.

The dominant lever is usually context growth, not model choice. Carrying forward eighty percent of context across four turns roughly doubles the input token bill versus a fresh context per turn. The calculator surfaces this explicitly so the trade off between memory quality and bill size is visible.

What fan out is in an LLM pipeline

Fan out is when one user request triggers multiple model calls. A research agent that checks five sources, critiques each, and summarizes them creates ten or more model calls before the user sees a single answer. A multi agent system can fan out by an order of magnitude before any human work is done.

Fan out is the silent cost driver. The visible UI shows one button click. The trace shows a tree. The bill follows the tree, not the button.

Caching savings that are real

Prompt caching reduces cost when the cached prefix is large and reused across many requests. For internal tools with stable system prompts and documents, cache hit rates between 30 and 70 percent are realistic and can cut input cost by half or more.

Caching helps less for unique customer conversations, exploratory queries, or workflows where most tokens are generated dynamically. Measure the actual hit rate from production traces before counting on the savings in a forecast.

A real world RAG pipeline cost example

A typical RAG QA pipeline runs four nodes per request: embed the query, retrieve top documents, rerank, and generate the answer. At 100,000 monthly requests with a $0.15 per million input token model, a 25 percent cache hit rate, an 8 percent retry rate, and a 6,000 token average context, the system costs roughly $1,500 to $3,500 per month before output tokens.

Add reranking and the cost rises by 20 to 40 percent. Add a critique pass for high stakes answers and the cost rises again. Use the calculator to build the actual cost from your own node graph rather than from a vendor headline price.

Why pipeline costs differ from single call costs

Production systems rarely make one model call. A RAG answer may embed a query, retrieve documents, rerank passages, generate an answer, run a critique pass, and write an audit event. Each stage has its own multiplier and its own failure rate.

The calculator separates total monthly cost, cost per request, and the dominant cost driver so you can see which node to optimize first. Most of the time the answer is the generation node, but in retrieval heavy pipelines reranking or embedding can dominate.

Assumptions and methodology

This tool uses transparent browser-side calculations and curated assumptions rather than LLM-generated recommendations. Outputs are planning estimates. They should be validated against provider pricing, production traces, engineering quotes, or domain review before money, compliance, safety, or hiring decisions are made.

Numerical defaults are dated and surfaced on the page. The methodology favours explicit assumptions over false precision: every estimate is meant to expose the variable that drives the result, not to pretend that early planning data is exact.

Turn the result into an implementation plan

Bring the scenario to a strategy call and I will pressure-test the workflow, assumptions, failure modes, and delivery path.

Book a strategy call

Frequently asked questions

How do I calculate AI agent cost per month?
Multiply monthly active sessions by average turns per session, then by average tool calls per turn, then by token cost per call. Add a retry multiplier, subtract a cache discount on input tokens, and add reasoning tokens at the output rate. The calculator does this end to end and also gives you cost per session and cost per user per month, which are the numbers a CFO actually wants.
What does a typical AI agent cost per user per month?
For a Claude Haiku 4.5 or GPT-5 mini agent answering five sessions a month with four turns and three tool calls per turn, expect roughly $0.50 to $3.00 per user per month at moderate token sizes. Heavier reasoning models, long carry forward context, or unbounded tool loops can push that into $10 to $50 per user. The calculator surfaces both numbers side by side.
How much does the OpenAI API cost in production?
OpenAI API cost depends on model and workload shape. GPT-5 nano is $0.05 per million input tokens and $0.40 per million output. GPT-5 is $1.25 in and $10 out. GPT-5.5 is $5 in and $30 out. A production chatbot serving 100,000 monthly requests on GPT-5 mini with moderate prompts typically lands between $40 and $300 per month before reasoning tokens. The calculator multiplies these prices by your real workload.
How much does the ChatGPT API cost per token?
The ChatGPT API is the same OpenAI API. Per token cost today ranges from $0.05 per million input tokens on GPT-5 nano to $30 per million output on GPT-5.5. Cached input tokens are billed at roughly 10 percent of the base rate, and Batch API workloads are billed at 50 percent. Plug your call volume into the calculator and it converts these per token numbers into a real monthly forecast.
How much does it cost to build an AI agent?
Building an AI agent costs from a few thousand dollars for a single tool MVP to $50,000 or more for a multi agent system with evals, guardrails, and observability. Operating that agent is a separate line item — that is what this calculator forecasts. For a build cost estimate use the agentic AI MVP cost tool linked in the related section; this calculator answers the ongoing monthly bill question.
Why does fan out matter so much in an LLM pipeline?
Fan out turns one request into many calls. A research agent that checks five sources, critiques each, and summarizes them may create ten or more model calls before the user sees one answer. Multi agent crews can fan out by an order of magnitude, which is why a per token estimate without a fan out factor underforecasts the bill.
Does prompt caching really save 90% on LLM cost?
Prompt caching can save a large fraction when the cached prefix is large and reused, often 40 to 70 percent on input tokens for internal tools. It saves little for highly unique prompts or agent loops where most tokens are generated dynamically, so the headline 90 percent number rarely matches a real production trace.
How do reasoning tokens affect agent cost?
Reasoning tokens from GPT-5 thinking, Claude extended thinking, and Gemini thinking are billed at the output rate and are usually invisible in the response shown to the user. Uncapped they can be three to ten times the visible output. The calculator has a separate reasoning tokens field so the forecast does not silently undercount this.