Leaderboard · Guide · Updated
The best LLM for translation in 2026
Ranked by language pair, quality vs DeepL / Google, per-1M-character cost, and document-level long-context translation. Frontier models (Claude Opus, GPT-5, Gemini 2.5 Pro) versus value picks (DeepSeek V3, Qwen 2.5, Gemini Flash).
OpenRouter routes Claude Opus, GPT-5, Gemini 2.5 Pro / Flash, DeepSeek V3, Qwen 2.5, Mistral Large and 100+ other LLMs behind a single key — pay-as-you-go, transparent per-token pricing, automatic failover. Try OpenRouter → (affiliate · supports this site)
TL;DR — best pick by use case
| Use case | Recommended | $ in / out (per 1M) | Why |
|---|---|---|---|
| Best overall translation quality | Claude Opus 4.1 | $15 / $75 | Idiomatic, tone-preserving, best on literary & legal prose. |
| Frontier quality + multimodal | GPT-5 | $1.25 / $10 | Strong on every major pair, native voice + image input. |
| Long documents (200k–2M tokens) | Gemini 2.5 Pro | $1.25 / $10 | 2M context — translate a full novel in one call, glossary stays consistent. |
| High-volume cheap translation | Gemini 2.5 Flash | $0.30 / $2.50 | Frontier-family quality at flash-tier price — best $/quality for batch. |
| CJK (zh / ja / ko) source or target | Qwen 2.5 72B | $0.35 / $0.40 | Native CJK tokeniser — ~2× cheaper per character than Western models. |
| Cheapest open-weights option | DeepSeek V3 | $0.27 / $1.10 | Strong on EN↔CJK and EN↔major-EU, self-hostable. |
| European languages, EU data residency | Mistral Large 2 | $2 / $6 | French / German / Italian / Spanish first-class; EU-hosted. |
How translation quality differs from chat quality
Translation is the LLM task where benchmark scores correlate least with real-world output. A model with high MMLU-Pro can still produce stiff, literal translations, while a 32B open-weights model fine-tuned on parallel corpora can outperform it on the same pair. The dimensions that actually matter:
- Tokenisation efficiencycostWestern tokenisers split CJK and Cyrillic into 1.5–2× more tokens, making a "$1.25 / 1M" model effectively $2+ per 1M Chinese characters. Qwen, DeepSeek, and Yi tokenisers stay close to 1:1 for their native languages.
- Tone & register preservationqualityFrontier models (Claude Opus 4.1, GPT-5) follow instructions like "translate into formal British English with subjunctive mood preserved" reliably. Smaller models drop the register half the time.
- Glossary consistency over long inputsqualityChunked translation breaks consistency. A 2M-context model (Gemini 2.5 Pro) translating a whole book in one shot is structurally better than a 32k model doing it in 60 chunks.
- Idiom & cultural adaptationquality"Break a leg" → "祝你好运" not "打断一条腿". Claude Opus 4.1 leads here; DeepL is surprisingly weak; smaller LLMs are coin flips.
- Output token bloatcostChinese → English typically doubles token count; English → German adds ~20%. Output price often dominates total cost — check the row below.
By language pair
English ↔ Chinese / Japanese / Korean (CJK)
Best value: Qwen 2.5 72B ($0.35/$0.40) and DeepSeek V3 ($0.27/$1.10). Both tokenise CJK natively, halving effective per-character cost vs GPT-5. Quality on EN↔ZH is at or above frontier-Western models for prose; slightly behind on rare technical terms.
Best quality: Claude Opus 4.1 for legal, medical, and literary translation where idiom and tone matter more than cost. GPT-5 is a close second and the right pick if you also need image OCR (translating Japanese signs, Chinese contracts in PDF).
For Chinese specifically, see the dedicated Best LLM for Chinese guide — it covers tokenisation and pricing in more depth.
English ↔ major European languages (FR, DE, ES, IT, PT)
Best quality: Claude Opus 4.1 and GPT-5 — both fluent and idiomatic in all five. Gendered grammar (German der/die/das, Romance noun agreement) is handled correctly far more often than mid-tier models.
Best EU-hosted option: Mistral Large 2. Trained with strong French / German / Italian / Spanish weighting; available with EU data residency for GDPR-sensitive workloads.
Best value: Gemini 2.5 Flash at $0.30/$2.50. For en↔fr or en↔de batch translation of marketing copy, support tickets, or product catalogues, the quality is indistinguishable from frontier 95%+ of the time at <5% the cost.
English ↔ low-resource languages (Vietnamese, Thai, Indonesian, Hindi, Swahili, etc.)
Best quality: Gemini 2.5 Pro. Google's training corpus has the broadest low-resource coverage, and Gemini consistently outperforms GPT-5 and Claude on Hindi, Bengali, Vietnamese, Thai, and most African languages on side-by-side evaluations.
Caveat: for any low-resource pair, run a 100-string side-by-side eval before committing. Quality varies wildly — a model that's great at en↔hi may be poor at en↔ta.
Multilingual within one document (code-switched, mixed)
Best: Claude Opus 4.1 and GPT-5. Both detect and preserve code-switching in input (e.g., a Hindi-English transcript) without flattening it into one language. Smaller models tend to over-translate.
Long document translation
The biggest practical difference between frontier models in 2026 is context window. A 2M-token model can translate a full novel (≈200k words = ~270k tokens) plus a 50k-token glossary plus 4 reference translations — all in one prompt, with full glossary consistency end-to-end. Chunked translation always breaks on cross-chapter terminology.
| Model | Context | $ in / out | Best for |
|---|---|---|---|
| Gemini 2.5 Pro | 2,000,000 | $1.25 / $10 | Full novels, full codebases, multi-document research papers. |
| GPT-4.1 | 1,000,000 | $2 / $8 | Long technical docs; OpenAI ecosystem integration. |
| GPT-5 | 400,000 | $1.25 / $10 | Mid-length docs (book chapters, long contracts). |
| Claude Opus 4.1 | 200,000 | $15 / $75 | ~150k-token docs where prose polish > cost. |
| DeepSeek R1 | 128,000 | $0.55 / $2.19 | Long-doc translation that needs step-by-step reasoning. |
Cost calculator: 1M characters / day translation workload
A typical translation pipeline ingesting 1M source characters of input and emitting ~1.2M characters of output per day (output expansion varies by language pair — ZH→EN is ~1.5×, EN→DE is ~1.2×, EN→ES is ~1.1×). Effective per-character costs assume average tokenisation ratios:
| Model | Daily cost | Monthly cost | Yearly cost |
|---|---|---|---|
| Qwen 2.5 72B (CJK pairs) | $0.45 | $14 | $164 |
| DeepSeek V3 | $1.10 | $33 | $402 |
| GPT-4o mini | $1.05 | $32 | $385 |
| Gemini 2.5 Flash | $3.45 | $104 | $1,259 |
| Mistral Large 2 | $10.80 | $324 | $3,942 |
| GPT-5 | $13.75 | $413 | $5,019 |
| Claude Opus 4.1 | $108 | $3,240 | $39,420 |
Assumes average output token ratios per language pair. To plug in your exact numbers, use the interactive API cost calculator — it accepts custom token counts so you can model your real tokenisation overhead.
OpenRouter exposes Claude Opus, GPT-5, Gemini 2.5 Pro / Flash, DeepSeek V3, Qwen 2.5, Mistral Large and 100+ others behind one key — same per-token price as direct, with automatic provider failover. The fastest way to A/B test 4 models on the same source corpus. Get an OpenRouter key → (affiliate)
LLMs vs DeepL / Google Translate
The honest answer in 2026: it depends on the task.
- DeepL still winsshort-formSentence-by-sentence translation of UI strings, product descriptions, and short customer messages between major European pairs. DeepL is faster, cheaper, and produces marginally more "natural-sounding" output for these.
- LLMs winlong-form / contextualAnything where context matters: long documents, legal contracts, literary prose, technical docs with code, idiom-heavy marketing copy, or when you need instructions like "preserve the formal register" or "use British English". Frontier LLMs (Claude Opus 4.1, GPT-5) match or exceed DeepL on professional-grade prose.
- LLMs uniquely winlow-resource & CJKFor CJK quality and for low-resource languages (Vietnamese, Thai, Hindi, Swahili), Gemini 2.5 Pro and Claude Opus 4.1 outperform DeepL — DeepL coverage is heavily biased toward European languages.
The real change since 2024 is that GPT-4o-mini and Gemini Flash are now cheaper-per-character than DeepL Pro for the same quality, breaking DeepL's price advantage on bulk workloads.
The verdict
Pick by language family and budget:
- Top quality, any pairClaude Opus 4.1 — escalate here when one document needs to be perfect.
- Best value, Latin-alphabetGemini 2.5 Flash — 95% of frontier quality at <5% the cost.
- Best value, CJKQwen 2.5 72B or DeepSeek V3 — native tokenisation halves per-character cost.
- Long documentsGemini 2.5 Pro — 2M context preserves glossary consistency end-to-end.
- EU data residencyMistral Large 2 — strong on FR/DE/IT/ES, EU-hosted.
Run a 100-string A/B on real source content before committing to one model — translation quality varies more by language pair than benchmark scores predict. OpenRouter exposes all of the above on a single key, which is the fastest way to A/B test before you wire up production.
Frequently asked questions
Is Claude better than GPT-5 for translation?
For literary, legal, and idiom-heavy prose, Claude Opus 4.1 tends to win on side-by-side blind tests. For technical documentation, code-mixed content, and multimodal inputs (translating an image of a sign), GPT-5 is stronger. Both are far ahead of mid-tier models — pick on price or on the multimodal needs of your pipeline.
Are LLMs better than DeepL or Google Translate?
For short, single-sentence translations between major European pairs, DeepL is still excellent and cheap. LLMs win on long documents, idioms, tone preservation, low-resource languages, and any translation where context or instructions matter. See the section above for a full comparison.
Which LLM is cheapest for translation?
For Latin-alphabet languages, Gemini 2.5 Flash ($0.30/$2.50) and GPT-4o mini ($0.15/$0.60). For CJK, Qwen 2.5 72B ($0.35/$0.40) and DeepSeek V3 ($0.27/$1.10) win on per-character cost because of native tokenisation.
What is the best LLM for translating long documents?
Gemini 2.5 Pro at 2M-token context is the only model that can hold a full novel or long technical manual in one request, and it preserves cross-chapter terminology consistency that chunked translation breaks. GPT-4.1 at 1M tokens is the strongest second pick.
Can I self-host an LLM for translation?
Yes — Qwen 2.5 72B, DeepSeek V3, and Llama 3.3 70B all have open weights. Qwen is the strongest pick for CJK; DeepSeek for general bilingual; Llama 3.3 for Latin-alphabet languages with English-to-X bias. A single H100 runs Qwen 2.5 72B at production speed; DeepSeek V3 needs more (it's a 671B-parameter MoE).
Related: Best LLM for Chinese · Best cheap LLM API · Best open-source LLM · Best LLM for RAG
Methodology and sources: see About. Spotted a mistake? Open an issue.
Get the weekly LLM digest
Big releases, leaderboard movements, price drops, and the chart that mattered this week — including translation-model updates. No spam.
Or follow updates on GitHub.