Leaderboard · Guide · Updated
The best LLM for RAG in 2026
Long-context recall, citation faithfulness, price per 1M tokens — the four numbers that actually matter for retrieval-augmented generation, ranked head-to-head.
OpenRouter routes GPT-5, Claude, Gemini, DeepSeek, Llama, Qwen and 100+ other LLMs behind a single key — pay-as-you-go, no monthly minimum, transparent per-token pricing. Try OpenRouter → (affiliate · supports this site)
TL;DR — best pick by RAG workload
| Workload | Best pick | Context | $ in / out (per 1M) |
|---|---|---|---|
| General production RAG | Gemini 2.5 Pro | 2M tokens | $1.25 / $10.00 |
| High-stakes RAG (legal, medical) | Claude Opus 4.1 | 200k | $15.00 / $75.00 |
| Ultra-cheap, high volume | Gemini 2.0 Flash | 1M | $0.10 / $0.40 |
| Open-weights / self-hosted | DeepSeek V3 | 128k | $0.27 / $1.10 |
| Whole-document, no chunking | Gemini 2.5 Pro | 2M tokens | $1.25 / $10.00 |
| Citation-heavy enterprise RAG | Claude Sonnet 4 | 200k | $3.00 / $15.00 |
OpenRouter exposes Gemini 2.5 Pro, Claude Opus 4.1, GPT-5, DeepSeek V3, and 100+ others behind a single key — same per-token price as direct, with automatic fallback if a provider is down. Try OpenRouter → (affiliate · supports this site)
The four numbers that decide a RAG model
- Context window. How many tokens of retrieved content can you stuff in before the model truncates? Anything below 32k is too small for serious RAG; 128k is the modern baseline; 1M+ unlocks "drop the whole PDF" workflows.
- Long-context recall. Models advertise 1M-token windows but their accuracy on facts buried at position 800k can drop 30%+. Independent "needle in a haystack" benchmarks matter more than the headline number.
- Citation faithfulness. When asked for sources, does the model actually quote the retrieved text or fabricate plausible-looking citations? Claude consistently leads here; older GPT-4 generations were notorious for hallucinated page numbers.
- Price per 1M tokens. RAG prompts run 5–50× longer than chat prompts, so input price dominates. A model that's $5/M input is 50× more expensive than Gemini Flash on the same RAG workload.
Long-context model line-up (2026)
| Model | Context | $ in / out | MMLU-Pro | Notes |
|---|---|---|---|---|
| Gemini 2.5 Pro | 2M | $1.25 / $10 | 82.6 | Industry-leading context, strong recall, fair price. |
| GPT-4.1 | 1M | $2.00 / $8 | 75.5 | OpenAI's purpose-built long-context model — strong synthesis. |
| Gemini 1.5 Pro | 2M | $1.25 / $5 | 75.8 | The original 2M model — slightly older, still very usable. |
| Gemini 2.0 Flash | 1M | $0.10 / $0.40 | 76.4 | Best-in-class price for long-context workloads. |
| Claude Opus 4.1 | 200k | $15 / $75 | 87.4 | Best citation faithfulness, premium tier. |
| Claude Sonnet 4 | 200k | $3 / $15 | 82.7 | Production-tier Claude with the same accuracy advantage. |
| GPT-5 | 400k | $1.25 / $10 | 86.8 | Strong all-rounder; better synthesis than Claude on multi-hop. |
| DeepSeek V3 | 128k | $0.27 / $1.10 | 75.9 | Open-weights, frontier-tier value for self-hosted RAG. |
| Llama 3.3 70B | 128k | $0.71 / $0.71 | 68.9 | The open-weights default, great when you control infra. |
Citation faithfulness: why Claude wins for high-stakes RAG
If your RAG pipeline serves lawyers, doctors, or auditors, the cost of a fabricated citation is far higher than the cost of saying "no answer found". On internal "needle in haystack with attribution" tests, the ranking is consistent:
- Claude Opus 4.1 / Sonnet 4 — fewest fabricated quotes; tends to refuse cleanly when retrieval is weak.
- GPT-5 — strong, but more willing to "fill in the gaps" with plausible-sounding text.
- Gemini 2.5 Pro — improved significantly over 1.5; still middle-tier on attribution.
- DeepSeek R1 / V3 — usable but requires stricter prompt scaffolding to avoid invented sources.
The fix isn't always to switch models — well-designed prompts with explicit "if the source does not contain the answer, reply NO_ANSWER_FOUND" instructions close most of the gap on cheaper models.
Cost example: 1,000 RAG queries/day with 50k-token retrieval
A typical enterprise RAG workload: 1k queries/day, each shoving 50k retrieved tokens + 500-token answer.
| Model | Per query | Per day | Per year |
|---|---|---|---|
| Gemini 2.0 Flash | $0.0052 | $5.20 | $1,898 |
| DeepSeek V3 | $0.0140 | $14.00 | $5,110 |
| Gemini 2.5 Pro | $0.0675 | $67.50 | $24,638 |
| GPT-5 | $0.0675 | $67.50 | $24,638 |
| Claude Sonnet 4 | $0.1575 | $157.50 | $57,488 |
| Claude Opus 4.1 | $0.7875 | $787.50 | $287,438 |
50,000 input + 500 output tokens, no caching. Cache discounts (Anthropic 90%, OpenAI 50%) cut these numbers significantly when prompts are reused.
Prompt-cache discounts can flip the ranking
If your RAG prompts share a stable system block (style guide, role prompt, top-K retrieval), prompt caching tilts the math hard:
- Anthropic caches at 90% off input price for 5-minute TTL — huge for repeated retrieval contexts.
- OpenAI caches at 50% off, automatically applied to repeated prompt prefixes.
- Google offers explicit context caching — only pays off above ~32k tokens of stable prefix.
- DeepSeek has no cache discount yet, but its base price is already low.
For a heavily-cached production RAG with 80% prefix reuse, Claude Sonnet 4 effectively costs ~$0.45/M input — closer to GPT-5 mini territory and worth re-running the math.
The verdict
For most production RAG, Gemini 2.5 Pro is the default: 2M context, strong recall, and $1.25/$10 puts it in the same ballpark as GPT-5 with 5× the window. For high-stakes citation work, Claude Opus 4.1 remains the gold standard. For budget-sensitive volume RAG, Gemini 2.0 Flash is genuinely unbeatable — and if you can self-host, DeepSeek V3 matches its quality at zero per-token cost.
The most expensive RAG mistake is using a frontier model when retrieval is the bottleneck. Fix your retriever first; only then upgrade the LLM.
Frequently asked questions
What is the best LLM for RAG in 2026?
Gemini 2.5 Pro is the current production default — 2M context, strong recall, and $1.25/$10 per 1M tokens makes it 5–10× cheaper per long-context query than Claude Opus 4.1. For citation-heavy work, Claude still wins on faithfulness.
Do I need a 1M-token context model for RAG?
No. Most production RAG works fine with 32k–128k context if your retriever is decent. Long-context only pays off when you want to skip chunking or do whole-document synthesis. Cost grows linearly with prompt length, so longer is not free.
Should I use RAG or fine-tuning?
RAG for fresh / changing data; fine-tuning for stable style or domain language. Most teams want both: RAG for facts, fine-tuning for tone. They're complementary, not competing.
What's the best embedding model to pair with my RAG LLM?
For English: OpenAI's text-embedding-3-large or Voyage AI's voyage-3 are the current top picks. For multilingual: BGE-M3 (open-source) or Cohere embed-multilingual-v3. The retriever quality matters more than the LLM choice for most RAG failures.
Can I run RAG with an open-source LLM?
Yes — DeepSeek V3 (128k context) and Llama 3.3 70B (128k context) both deliver production-grade RAG. Qwen 2.5 72B is a popular pick for Chinese / multilingual RAG. Self-hosting on 4×H100 puts your $/M token cost near zero at scale.
Related: Best cheap LLM API · Best open-source LLM · Best LLM for coding · Full leaderboard
Spotted out-of-date numbers? Open an issue — corrections usually ship within 24h.
Get the weekly LLM digest
Long-context updates, RAG benchmarks, and price drops — straight to your inbox. No spam.
Or follow updates on GitHub.