How is pipeline cost different from per-token cost?

Per-token cost prices one call. Pipeline cost multiplies every node, branch probability, fan-out, retry, and optimisation. A cheap model can become expensive when a workflow calls it many times per user request.

Why does fan-out matter so much?

Fan-out turns one request into many calls. A research agent that checks five sources, critiques each, and summarizes them may create ten or more model calls before the user sees one answer.

Does prompt caching really save 90%?

Prompt caching can save a lot when the cached prefix is large and reused. It does not save much for highly unique prompts or workflows where most tokens are generated dynamically.

When does the Batch API help?

Batch APIs help asynchronous workloads such as backfills, offline enrichment, eval runs, and nightly summarization. They are less useful when the user expects an immediate interactive response.

Can I model conditional branching?

Yes. Use the branch probability field to represent a router. A compliance review branch that runs on 20% of requests should use a 0.2 probability rather than being counted on every request.

What's a realistic cache-hit rate?

For internal tools with repeated system prompts and documents, 30% to 70% can be realistic. For unique customer conversations, cache hit rates may be far lower and should be measured from traces.

LLM Pipeline Cost Calculator (Free, No Login)

Direct answer

Use this calculator when one product action triggers multiple LLM calls, retrieval steps, retries, or fan-out branches.

RAG QA pipeline

Input: 100,000 monthly requests, four pipeline nodes, 8% retry rate, 25% prompt-cache hit rate.

Output: The output separates total monthly cost, cost per request, and the dominant cost driver.

How to use this tool

1. Pick a template.
2. Set monthly request volume.
3. Adjust retry, cache, and batch settings.
4. Review cost drivers and optimisation suggestions.

Why pipeline costs differ from single-call costs

Production systems rarely make one model call. A RAG answer may embed a query, retrieve documents, rerank passages, generate an answer, run a critique pass, and write an audit event. Each stage has its own multiplier.

Fan-out is the silent cost driver. If one user request creates five branch calls and two retries, the bill follows the graph rather than the visible UI action.

Optimisations that actually save money

Caching helps when prompts repeat. Batching helps asynchronous workloads. Routing saves money when easy cases can use smaller models. The calculator shows savings as a directional estimate, not a guarantee, because cache hit rates and routing probabilities must be measured in production.

Assumptions and methodology

This tool uses transparent browser-side calculations and curated assumptions rather than LLM-generated recommendations. Outputs are planning estimates. They should be validated against provider pricing, production traces, engineering quotes, or domain review before money, compliance, safety, or hiring decisions are made.

Numerical defaults are dated and surfaced on the page. The methodology favours explicit assumptions over false precision: every estimate is meant to expose the variable that drives the result, not to pretend that early planning data is exact.

Turn the result into an implementation plan

Bring the scenario to a strategy call and I will pressure-test the workflow, assumptions, failure modes, and delivery path.

Book a strategy call

Frequently asked questions

How is pipeline cost different from per-token cost?: Per-token cost prices one call. Pipeline cost multiplies every node, branch probability, fan-out, retry, and optimisation. A cheap model can become expensive when a workflow calls it many times per user request.
Why does fan-out matter so much?: Fan-out turns one request into many calls. A research agent that checks five sources, critiques each, and summarizes them may create ten or more model calls before the user sees one answer.
Does prompt caching really save 90%?: Prompt caching can save a lot when the cached prefix is large and reused. It does not save much for highly unique prompts or workflows where most tokens are generated dynamically.
When does the Batch API help?: Batch APIs help asynchronous workloads such as backfills, offline enrichment, eval runs, and nightly summarization. They are less useful when the user expects an immediate interactive response.
Can I model conditional branching?: Yes. Use the branch probability field to represent a router. A compliance review branch that runs on 20% of requests should use a 0.2 probability rather than being counted on every request.
What's a realistic cache-hit rate?: For internal tools with repeated system prompts and documents, 30% to 70% can be realistic. For unique customer conversations, cache hit rates may be far lower and should be measured from traces.

Sources

Internal links

AI Systems Architecture AI systems architect role RAG Cost Calculator AI Agent ROI Calculator