What is the best LLM for translation in 2026?

For top-quality translation across most language pairs, Claude Opus 4.1 leads on idiomatic accuracy and tone preservation, with GPT-5 close behind. For high-volume cheap translation, Gemini 2.5 Flash, GPT-4o mini, and DeepSeek V3 give 90% of frontier quality at 1–5% of the cost. For Asian-language translation (zh/ja/ko), Qwen 2.5 72B and DeepSeek V3 beat most Western models on per-character cost and tonal nuance. For document-level translation across 1M+ tokens, Gemini 2.5 Pro is the only practical option.

Which LLM is cheapest for high-volume translation?

For Latin-alphabet languages, Gemini 2.5 Flash ($0.30 input / $2.50 output per 1M tokens) and GPT-4o mini ($0.15 / $0.60) are the cheapest mainstream APIs. For CJK source/target, native-tokenised models (Qwen 2.5 72B at $0.35/$0.40, DeepSeek V3 at $0.27/$1.10) are 1.5–2× cheaper per character because they tokenise CJK efficiently. Always benchmark on real samples — output volume and tokenisation overhead matter more than headline price.

Can I translate technical documentation with an LLM?

Yes — and this is one of the strongest use cases. LLMs preserve code blocks, API names, and product nouns reliably when prompted, and can be instructed to keep specific terms untranslated. Claude Opus 4.1 and GPT-5 handle inline code + prose mixes the cleanest; Gemini 2.5 Pro is the right choice if the doc is over 200k tokens because it holds glossary consistency across the whole input.

Leaderboard · Guide · Updated 2026-05-10

The best LLM for translation in 2026

Q: Are LLMs better than DeepL or Google Translate?

For literal sentence-level translation between major language pairs, DeepL still has the edge on average — it is built for translation and is faster and cheaper for short strings. LLMs win where context matters: long documents, idioms, tone preservation, technical jargon, code-switched content, and any translation that benefits from instructions like 'preserve the formal register' or 'use British English'. Frontier LLMs (Claude Opus 4.1, GPT-5) match or exceed DeepL on professional-grade prose where DeepL produces robotic output.

Q: What is the best LLM for long document translation?

Gemini 2.5 Pro at 2M-token context is the only frontier model that can hold a full novel or long technical manual in a single request, and it preserves cross-chapter terminology consistency that chunked translation breaks. GPT-4.1 at 1M tokens is the strongest second pick. For shorter documents (under 200k tokens), Claude Opus 4.1 produces the most polished prose.

Ranked by language pair, quality vs DeepL / Google, per-1M-character cost, and document-level long-context translation. Frontier models (Claude Opus, GPT-5, Gemini 2.5 Pro) versus value picks (DeepSeek V3, Qwen 2.5, Gemini Flash).

TL;DR — best pick by use case

Use case	Recommended	$ in / out (per 1M)	Why
Best overall translation quality	Claude Opus 4.1	$15 / $75	Idiomatic, tone-preserving, best on literary & legal prose.
Frontier quality + multimodal	GPT-5	$1.25 / $10	Strong on every major pair, native voice + image input.
Long documents (200k–2M tokens)	Gemini 2.5 Pro	$1.25 / $10	2M context — translate a full novel in one call, glossary stays consistent.
High-volume cheap translation	Gemini 2.5 Flash	$0.30 / $2.50	Frontier-family quality at flash-tier price — best $/quality for batch.
CJK (zh / ja / ko) source or target	Qwen 2.5 72B	$0.35 / $0.40	Native CJK tokeniser — ~2× cheaper per character than Western models.
Cheapest open-weights option	DeepSeek V3	$0.27 / $1.10	Strong on EN↔CJK and EN↔major-EU, self-hostable.
European languages, EU data residency	Mistral Large 2	$2 / $6	French / German / Italian / Spanish first-class; EU-hosted.

How translation quality differs from chat quality

Translation is the LLM task where benchmark scores correlate least with real-world output. A model with high MMLU-Pro can still produce stiff, literal translations, while a 32B open-weights model fine-tuned on parallel corpora can outperform it on the same pair. The dimensions that actually matter:

Tokenisation efficiencycostWestern tokenisers split CJK and Cyrillic into 1.5–2× more tokens, making a "$1.25 / 1M" model effectively $2+ per 1M Chinese characters. Qwen, DeepSeek, and Yi tokenisers stay close to 1:1 for their native languages.
Tone & register preservationqualityFrontier models (Claude Opus 4.1, GPT-5) follow instructions like "translate into formal British English with subjunctive mood preserved" reliably. Smaller models drop the register half the time.
Glossary consistency over long inputsqualityChunked translation breaks consistency. A 2M-context model (Gemini 2.5 Pro) translating a whole book in one shot is structurally better than a 32k model doing it in 60 chunks.
Idiom & cultural adaptationquality"Break a leg" → "祝你好运" not "打断一条腿". Claude Opus 4.1 leads here; DeepL is surprisingly weak; smaller LLMs are coin flips.
Output token bloatcostChinese → English typically doubles token count; English → German adds ~20%. Output price often dominates total cost — check the row below.

By language pair

English ↔ Chinese / Japanese / Korean (CJK)

Best value: Qwen 2.5 72B ($0.35/$0.40) and DeepSeek V3 ($0.27/$1.10). Both tokenise CJK natively, halving effective per-character cost vs GPT-5. Quality on EN↔ZH is at or above frontier-Western models for prose; slightly behind on rare technical terms.

Best quality: Claude Opus 4.1 for legal, medical, and literary translation where idiom and tone matter more than cost. GPT-5 is a close second and the right pick if you also need image OCR (translating Japanese signs, Chinese contracts in PDF).

For Chinese specifically, see the dedicated Best LLM for Chinese guide — it covers tokenisation and pricing in more depth.

English ↔ major European languages (FR, DE, ES, IT, PT)

Best quality: Claude Opus 4.1 and GPT-5 — both fluent and idiomatic in all five. Gendered grammar (German der/die/das, Romance noun agreement) is handled correctly far more often than mid-tier models.

Best EU-hosted option: Mistral Large 2. Trained with strong French / German / Italian / Spanish weighting; available with EU data residency for GDPR-sensitive workloads.

Best value: Gemini 2.5 Flash at $0.30/$2.50. For en↔fr or en↔de batch translation of marketing copy, support tickets, or product catalogues, the quality is indistinguishable from frontier 95%+ of the time at <5% the cost.

English ↔ low-resource languages (Vietnamese, Thai, Indonesian, Hindi, Swahili, etc.)

Best quality: Gemini 2.5 Pro. Google's training corpus has the broadest low-resource coverage, and Gemini consistently outperforms GPT-5 and Claude on Hindi, Bengali, Vietnamese, Thai, and most African languages on side-by-side evaluations.

Caveat: for any low-resource pair, run a 100-string side-by-side eval before committing. Quality varies wildly — a model that's great at en↔hi may be poor at en↔ta.

Multilingual within one document (code-switched, mixed)

Best: Claude Opus 4.1 and GPT-5. Both detect and preserve code-switching in input (e.g., a Hindi-English transcript) without flattening it into one language. Smaller models tend to over-translate.

Long document translation

The biggest practical difference between frontier models in 2026 is context window. A 2M-token model can translate a full novel (≈200k words = ~270k tokens) plus a 50k-token glossary plus 4 reference translations — all in one prompt, with full glossary consistency end-to-end. Chunked translation always breaks on cross-chapter terminology.

Model	Context	$ in / out	Best for
Gemini 2.5 Pro	2,000,000	$1.25 / $10	Full novels, full codebases, multi-document research papers.
GPT-4.1	1,000,000	$2 / $8	Long technical docs; OpenAI ecosystem integration.
GPT-5	400,000	$1.25 / $10	Mid-length docs (book chapters, long contracts).
Claude Opus 4.1	200,000	$15 / $75	~150k-token docs where prose polish > cost.
DeepSeek R1	128,000	$0.55 / $2.19	Long-doc translation that needs step-by-step reasoning.

Cost calculator: 1M characters / day translation workload

A typical translation pipeline ingesting 1M source characters of input and emitting ~1.2M characters of output per day (output expansion varies by language pair — ZH→EN is ~1.5×, EN→DE is ~1.2×, EN→ES is ~1.1×). Effective per-character costs assume average tokenisation ratios:

Model	Daily cost	Monthly cost	Yearly cost
Qwen 2.5 72B (CJK pairs)	$0.45	$14	$164
DeepSeek V3	$1.10	$33	$402
GPT-4o mini	$1.05	$32	$385
Gemini 2.5 Flash	$3.45	$104	$1,259
Mistral Large 2	$10.80	$324	$3,942
GPT-5	$13.75	$413	$5,019
Claude Opus 4.1	$108	$3,240	$39,420

Assumes average output token ratios per language pair. To plug in your exact numbers, use the interactive API cost calculator — it accepts custom token counts so you can model your real tokenisation overhead.

Run a multi-model translation A/B with one API key.

OpenRouter exposes Claude Opus, GPT-5, Gemini 2.5 Pro / Flash, DeepSeek V3, Qwen 2.5, Mistral Large and 100+ others behind one key — same per-token price as direct, with automatic provider failover. The fastest way to A/B test 4 models on the same source corpus. Get an OpenRouter key → (affiliate)

LLMs vs DeepL / Google Translate

The honest answer in 2026: it depends on the task.

DeepL still winsshort-formSentence-by-sentence translation of UI strings, product descriptions, and short customer messages between major European pairs. DeepL is faster, cheaper, and produces marginally more "natural-sounding" output for these.
LLMs winlong-form / contextualAnything where context matters: long documents, legal contracts, literary prose, technical docs with code, idiom-heavy marketing copy, or when you need instructions like "preserve the formal register" or "use British English". Frontier LLMs (Claude Opus 4.1, GPT-5) match or exceed DeepL on professional-grade prose.
LLMs uniquely winlow-resource & CJKFor CJK quality and for low-resource languages (Vietnamese, Thai, Hindi, Swahili), Gemini 2.5 Pro and Claude Opus 4.1 outperform DeepL — DeepL coverage is heavily biased toward European languages.

The real change since 2024 is that GPT-4o-mini and Gemini Flash are now cheaper-per-character than DeepL Pro for the same quality, breaking DeepL's price advantage on bulk workloads.

The verdict

Pick by language family and budget:

Top quality, any pairClaude Opus 4.1 — escalate here when one document needs to be perfect.
Best value, Latin-alphabetGemini 2.5 Flash — 95% of frontier quality at <5% the cost.
Best value, CJKQwen 2.5 72B or DeepSeek V3 — native tokenisation halves per-character cost.
Long documentsGemini 2.5 Pro — 2M context preserves glossary consistency end-to-end.
EU data residencyMistral Large 2 — strong on FR/DE/IT/ES, EU-hosted.

Run a 100-string A/B on real source content before committing to one model — translation quality varies more by language pair than benchmark scores predict. OpenRouter exposes all of the above on a single key, which is the fastest way to A/B test before you wire up production.

Frequently asked questions

Is Claude better than GPT-5 for translation?

For literary, legal, and idiom-heavy prose, Claude Opus 4.1 tends to win on side-by-side blind tests. For technical documentation, code-mixed content, and multimodal inputs (translating an image of a sign), GPT-5 is stronger. Both are far ahead of mid-tier models — pick on price or on the multimodal needs of your pipeline.

Are LLMs better than DeepL or Google Translate?

For short, single-sentence translations between major European pairs, DeepL is still excellent and cheap. LLMs win on long documents, idioms, tone preservation, low-resource languages, and any translation where context or instructions matter. See the section above for a full comparison.

Which LLM is cheapest for translation?

For Latin-alphabet languages, Gemini 2.5 Flash ($0.30/$2.50) and GPT-4o mini ($0.15/$0.60). For CJK, Qwen 2.5 72B ($0.35/$0.40) and DeepSeek V3 ($0.27/$1.10) win on per-character cost because of native tokenisation.

What is the best LLM for translating long documents?

Gemini 2.5 Pro at 2M-token context is the only model that can hold a full novel or long technical manual in one request, and it preserves cross-chapter terminology consistency that chunked translation breaks. GPT-4.1 at 1M tokens is the strongest second pick.

Can I self-host an LLM for translation?

Yes — Qwen 2.5 72B, DeepSeek V3, and Llama 3.3 70B all have open weights. Qwen is the strongest pick for CJK; DeepSeek for general bilingual; Llama 3.3 for Latin-alphabet languages with English-to-X bias. A single H100 runs Qwen 2.5 72B at production speed; DeepSeek V3 needs more (it's a 671B-parameter MoE).

Methodology and sources: see About. Spotted a mistake? Open an issue.

Get the weekly LLM digest

Big releases, leaderboard movements, price drops, and the chart that mattered this week — including translation-model updates. No spam.

Or follow updates on GitHub.