Direct answer
Retrieval augmented generation is a pattern where a language model answers questions by retrieving documents from a vector index first, then generating an answer grounded in those documents. A RAG pipeline cost has five layers: one time indexing, ongoing storage, query side embeddings, retrieval and reranking, and final answer generation.
Internal docs assistant
Input: 1GB corpus, 500 token chunks, 50 token overlap, pgvector, 100,000 monthly queries, top k 6.
Output: The output should show one time index cost, monthly query cost, storage cost, and 10x scale forecast.
How to use this tool
- 1. Enter corpus size and chunking.
- 2. Choose vector DB and embedding model.
- 3. Set query volume and generation tokens.
- 4. Review index, storage, query, and 10x scale costs.
What is retrieval augmented generation
Retrieval augmented generation is a pattern that answers a user query by retrieving relevant documents from a vector index first and then asking a language model to generate an answer grounded in those documents. It exists because language models cannot know your private data and because keeping the model grounded in real sources reduces hallucination compared to free form generation.
A practical RAG pipeline includes chunking, embedding, indexing, query embedding, retrieval, optional reranking, and final generation. Each step costs money and adds latency, so the system should only retrieve when retrieval improves the answer.
How a RAG pipeline costs add up
One time costs include embedding the entire corpus and writing it to a vector database. Ongoing storage costs follow the vector count and the chosen vector database tier. Per query costs include embedding the user query, retrieving top documents, optionally reranking them, and generating the final answer with the retrieved context concatenated to the prompt.
The dominant cost is usually generation tokens, because the prompt now includes retrieved context that can run to thousands of tokens. Top k size, reranking depth, and chunk overlap all multiply the input token count, and therefore the bill.
RAG vs fine tuning cost compared
RAG is usually cheaper when knowledge changes often, citations matter, and the model should not memorize private data. Fine tuning is usually cheaper when style or format must be enforced across millions of calls and the data is stable. A 1GB corpus refreshed weekly costs almost nothing to re embed but is expensive to keep re tuning. A consistent output format across high volume calls is cheaper to tune in than to enforce with retrieval.
Most production systems end up using both. RAG for knowledge freshness and citations, fine tuning or careful prompting for format and tone. The calculator helps you separate the two costs so you can pick the right tool for each goal.
Embedding provider cost comparison
Embedding costs in May 2026 range from roughly $0.02 to $0.13 per million tokens depending on provider and model size. OpenAI, Cohere, Voyage, and self hosted models each have different cost and quality tradeoffs. A 1GB corpus of 500 token chunks contains roughly 250 million tokens, so embedding costs typically range from $5 to $35 for a single full pass.
Self hosted embeddings remove per call cost but add GPU and operational cost. They become cheaper than hosted models when monthly embedding volume passes roughly 500 million tokens, which is the break even most teams hit only at scale.
Who can implement governance for retrieval augmented generation
Governance for RAG sits with the team that owns the data, not only the team that builds the model. A working pattern is to assign a named owner per corpus who controls access rules, retention, freshness checks, and removal requests. The agent team is responsible for citation surfaces, audit logging, and refusal behaviour when retrieval returns weak or sensitive results.
If no one owns a corpus, treat that as a blocker. RAG without a data owner is a compliance incident waiting to happen, regardless of how strong the retrieval quality is.
Assumptions and methodology
This tool uses transparent browser-side calculations and curated assumptions rather than LLM-generated recommendations. Outputs are planning estimates. They should be validated against provider pricing, production traces, engineering quotes, or domain review before money, compliance, safety, or hiring decisions are made.
Numerical defaults are dated and surfaced on the page. The methodology favours explicit assumptions over false precision: every estimate is meant to expose the variable that drives the result, not to pretend that early planning data is exact.
Turn the result into an implementation plan
Bring the scenario to a strategy call and I will pressure-test the workflow, assumptions, failure modes, and delivery path.
Book a strategy callFrequently asked questions
- How much does a RAG pipeline cost?
- A small internal RAG system with a 1GB corpus and 100,000 monthly queries usually costs $100 to $500 per month including embeddings, storage, retrieval, and generation. A larger production RAG with reranking, multi tenant isolation, and audit logging can pass $5,000 per month. The dominant cost is usually generation tokens because retrieved context inflates the prompt.
- What is the difference between RAG and fine tuning cost?
- RAG is usually cheaper when knowledge changes often or citations matter. Fine tuning is usually cheaper when style or format must be enforced across millions of calls and the data is stable. Most production systems use both. The calculator separates RAG cost so you can compare it side by side against a fine tuning quote.
- How is RAG cost different from a single LLM call?
- RAG adds indexing, vector storage, retrieval, optional reranking, and a larger generation prompt. The user sees one answer, but the system performs several paid operations before generating it. The per query cost of a RAG system is often three to ten times higher than a single ungrounded model call.
- What is a sensible chunk size for RAG?
- A common starting point is 400 to 800 tokens with 10 to 20 percent overlap. Smaller chunks improve precision but increase vector count, storage, and retrieval overhead. Larger chunks reduce vector count but can dilute relevance. Tune the chunk size against actual eval scores rather than picking a number from a blog post.
- Should I use small or large embedding models?
- Use small embeddings for cost sensitive or broad retrieval. Use larger embeddings when retrieval quality is the bottleneck and the cost is justified by accuracy or reduced human review. The cost gap between a small and large embedding model is usually less than the cost gap between weak retrieval and strong retrieval at scale.
- When is pgvector cheaper than a managed vector database?
- pgvector is often cheaper when your team already operates Postgres and scale is moderate, typically under 50 million vectors. Managed vector databases justify their cost when you need dedicated scaling, hybrid search, multi tenant isolation, or operational tooling that a generic Postgres deployment does not provide.
- How much does reranking add to RAG cost?
- Reranking adds another model call per query or per candidate set, typically 20 to 40 percent extra cost. It can materially improve answer quality, but high volume systems should test whether the quality gain offsets the latency and cost. Measure rerank impact on actual evals before making it permanent.