Direct answer
Use this calculator to estimate RAG cost across indexing, vector storage, retrieval, generation, reranking, and 10x scale.
Internal docs assistant
Input: 1GB corpus, 500-token chunks, 50-token overlap, pgvector, 100,000 monthly queries, top-k 6.
Output: The output should show one-time index cost, monthly query cost, storage cost, and 10x scale forecast.
How to use this tool
- 1. Enter corpus size and chunking.
- 2. Choose vector DB and embedding model.
- 3. Set query volume and generation tokens.
- 4. Review index, storage, query, and 10x scale costs.
RAG cost has three layers
RAG cost is not a single model call. You pay to embed and index the corpus, store vectors, embed queries, retrieve and rerank documents, and generate final answers.
Chunk size, overlap, top-k, and reranking multiply cost. A cheap embedding model can still become expensive if the corpus is large and re-indexed often.
When RAG is cheaper than fine-tuning
RAG is usually cheaper when knowledge changes often, citations matter, and the model should not memorize private data. Fine-tuning can make sense for style, formats, or repeated behaviours, but it is rarely the first answer for knowledge freshness.
Assumptions and methodology
This tool uses transparent browser-side calculations and curated assumptions rather than LLM-generated recommendations. Outputs are planning estimates. They should be validated against provider pricing, production traces, engineering quotes, or domain review before money, compliance, safety, or hiring decisions are made.
Numerical defaults are dated and surfaced on the page. The methodology favours explicit assumptions over false precision: every estimate is meant to expose the variable that drives the result, not to pretend that early planning data is exact.
Turn the result into an implementation plan
Bring the scenario to a strategy call and I will pressure-test the workflow, assumptions, failure modes, and delivery path.
Book a strategy callFrequently asked questions
- How is RAG cost different from a single LLM call?
- RAG adds indexing, vector storage, retrieval, optional reranking, and a larger generation prompt. The user sees one answer, but the system performs several paid operations before generating it.
- What's a sensible chunk size?
- A common starting point is 400 to 800 tokens with 10% to 20% overlap. Smaller chunks improve precision but increase vector count and retrieval overhead.
- Should I use small or large embeddings?
- Use small embeddings for cost-sensitive or broad retrieval. Use larger embeddings when retrieval quality is the bottleneck and the cost is justified by accuracy or reduced human review.
- When is pgvector cheaper?
- pgvector is often cheaper when your team already operates Postgres and scale is moderate. Managed vector databases can justify cost when you need dedicated scaling, filtering, and operational tooling.
- How much does reranking add?
- Reranking adds another model call per query or per candidate set. It can improve answer quality, but high-volume systems should test whether the quality gain offsets latency and cost.
- Is caching realistic for RAG?
- Caching is realistic for repeated questions, stable corpora, and internal knowledge bases. It is weaker for highly personalised, fresh, or exploratory queries where retrieval context changes often.