LLM Rank.top

Leaderboard · Guide · Updated

The best LLM for translation in 2026

Ranked by language pair, quality vs DeepL / Google, per-1M-character cost, and document-level long-context translation. Frontier models (Claude Opus, GPT-5, Gemini 2.5 Pro) versus value picks (DeepSeek V3, Qwen 2.5, Gemini Flash).

Try every translation model in this guide from one API key.

OpenRouter routes Claude Opus, GPT-5, Gemini 2.5 Pro / Flash, DeepSeek V3, Qwen 2.5, Mistral Large and 100+ other LLMs behind a single key — pay-as-you-go, transparent per-token pricing, automatic failover. Try OpenRouter → (affiliate · supports this site)

TL;DR — best pick by use case

Use caseRecommended$ in / out (per 1M)Why
Best overall translation qualityClaude Opus 4.1$15 / $75Idiomatic, tone-preserving, best on literary & legal prose.
Frontier quality + multimodalGPT-5$1.25 / $10Strong on every major pair, native voice + image input.
Long documents (200k–2M tokens)Gemini 2.5 Pro$1.25 / $102M context — translate a full novel in one call, glossary stays consistent.
High-volume cheap translationGemini 2.5 Flash$0.30 / $2.50Frontier-family quality at flash-tier price — best $/quality for batch.
CJK (zh / ja / ko) source or targetQwen 2.5 72B$0.35 / $0.40Native CJK tokeniser — ~2× cheaper per character than Western models.
Cheapest open-weights optionDeepSeek V3$0.27 / $1.10Strong on EN↔CJK and EN↔major-EU, self-hostable.
European languages, EU data residencyMistral Large 2$2 / $6French / German / Italian / Spanish first-class; EU-hosted.

How translation quality differs from chat quality

Translation is the LLM task where benchmark scores correlate least with real-world output. A model with high MMLU-Pro can still produce stiff, literal translations, while a 32B open-weights model fine-tuned on parallel corpora can outperform it on the same pair. The dimensions that actually matter:

By language pair

English ↔ Chinese / Japanese / Korean (CJK)

Best value: Qwen 2.5 72B ($0.35/$0.40) and DeepSeek V3 ($0.27/$1.10). Both tokenise CJK natively, halving effective per-character cost vs GPT-5. Quality on EN↔ZH is at or above frontier-Western models for prose; slightly behind on rare technical terms.

Best quality: Claude Opus 4.1 for legal, medical, and literary translation where idiom and tone matter more than cost. GPT-5 is a close second and the right pick if you also need image OCR (translating Japanese signs, Chinese contracts in PDF).

For Chinese specifically, see the dedicated Best LLM for Chinese guide — it covers tokenisation and pricing in more depth.

English ↔ major European languages (FR, DE, ES, IT, PT)

Best quality: Claude Opus 4.1 and GPT-5 — both fluent and idiomatic in all five. Gendered grammar (German der/die/das, Romance noun agreement) is handled correctly far more often than mid-tier models.

Best EU-hosted option: Mistral Large 2. Trained with strong French / German / Italian / Spanish weighting; available with EU data residency for GDPR-sensitive workloads.

Best value: Gemini 2.5 Flash at $0.30/$2.50. For en↔fr or en↔de batch translation of marketing copy, support tickets, or product catalogues, the quality is indistinguishable from frontier 95%+ of the time at <5% the cost.

English ↔ low-resource languages (Vietnamese, Thai, Indonesian, Hindi, Swahili, etc.)

Best quality: Gemini 2.5 Pro. Google's training corpus has the broadest low-resource coverage, and Gemini consistently outperforms GPT-5 and Claude on Hindi, Bengali, Vietnamese, Thai, and most African languages on side-by-side evaluations.

Caveat: for any low-resource pair, run a 100-string side-by-side eval before committing. Quality varies wildly — a model that's great at en↔hi may be poor at en↔ta.

Multilingual within one document (code-switched, mixed)

Best: Claude Opus 4.1 and GPT-5. Both detect and preserve code-switching in input (e.g., a Hindi-English transcript) without flattening it into one language. Smaller models tend to over-translate.

Long document translation

The biggest practical difference between frontier models in 2026 is context window. A 2M-token model can translate a full novel (≈200k words = ~270k tokens) plus a 50k-token glossary plus 4 reference translations — all in one prompt, with full glossary consistency end-to-end. Chunked translation always breaks on cross-chapter terminology.

ModelContext$ in / outBest for
Gemini 2.5 Pro2,000,000$1.25 / $10Full novels, full codebases, multi-document research papers.
GPT-4.11,000,000$2 / $8Long technical docs; OpenAI ecosystem integration.
GPT-5400,000$1.25 / $10Mid-length docs (book chapters, long contracts).
Claude Opus 4.1200,000$15 / $75~150k-token docs where prose polish > cost.
DeepSeek R1128,000$0.55 / $2.19Long-doc translation that needs step-by-step reasoning.

Cost calculator: 1M characters / day translation workload

A typical translation pipeline ingesting 1M source characters of input and emitting ~1.2M characters of output per day (output expansion varies by language pair — ZH→EN is ~1.5×, EN→DE is ~1.2×, EN→ES is ~1.1×). Effective per-character costs assume average tokenisation ratios:

ModelDaily costMonthly costYearly cost
Qwen 2.5 72B (CJK pairs)$0.45$14$164
DeepSeek V3$1.10$33$402
GPT-4o mini$1.05$32$385
Gemini 2.5 Flash$3.45$104$1,259
Mistral Large 2$10.80$324$3,942
GPT-5$13.75$413$5,019
Claude Opus 4.1$108$3,240$39,420

Assumes average output token ratios per language pair. To plug in your exact numbers, use the interactive API cost calculator — it accepts custom token counts so you can model your real tokenisation overhead.

Run a multi-model translation A/B with one API key.

OpenRouter exposes Claude Opus, GPT-5, Gemini 2.5 Pro / Flash, DeepSeek V3, Qwen 2.5, Mistral Large and 100+ others behind one key — same per-token price as direct, with automatic provider failover. The fastest way to A/B test 4 models on the same source corpus. Get an OpenRouter key → (affiliate)

LLMs vs DeepL / Google Translate

The honest answer in 2026: it depends on the task.

The real change since 2024 is that GPT-4o-mini and Gemini Flash are now cheaper-per-character than DeepL Pro for the same quality, breaking DeepL's price advantage on bulk workloads.

The verdict

Pick by language family and budget:

Run a 100-string A/B on real source content before committing to one model — translation quality varies more by language pair than benchmark scores predict. OpenRouter exposes all of the above on a single key, which is the fastest way to A/B test before you wire up production.

Frequently asked questions

Is Claude better than GPT-5 for translation?

For literary, legal, and idiom-heavy prose, Claude Opus 4.1 tends to win on side-by-side blind tests. For technical documentation, code-mixed content, and multimodal inputs (translating an image of a sign), GPT-5 is stronger. Both are far ahead of mid-tier models — pick on price or on the multimodal needs of your pipeline.

Are LLMs better than DeepL or Google Translate?

For short, single-sentence translations between major European pairs, DeepL is still excellent and cheap. LLMs win on long documents, idioms, tone preservation, low-resource languages, and any translation where context or instructions matter. See the section above for a full comparison.

Which LLM is cheapest for translation?

For Latin-alphabet languages, Gemini 2.5 Flash ($0.30/$2.50) and GPT-4o mini ($0.15/$0.60). For CJK, Qwen 2.5 72B ($0.35/$0.40) and DeepSeek V3 ($0.27/$1.10) win on per-character cost because of native tokenisation.

What is the best LLM for translating long documents?

Gemini 2.5 Pro at 2M-token context is the only model that can hold a full novel or long technical manual in one request, and it preserves cross-chapter terminology consistency that chunked translation breaks. GPT-4.1 at 1M tokens is the strongest second pick.

Can I self-host an LLM for translation?

Yes — Qwen 2.5 72B, DeepSeek V3, and Llama 3.3 70B all have open weights. Qwen is the strongest pick for CJK; DeepSeek for general bilingual; Llama 3.3 for Latin-alphabet languages with English-to-X bias. A single H100 runs Qwen 2.5 72B at production speed; DeepSeek V3 needs more (it's a 671B-parameter MoE).


Related: Best LLM for Chinese · Best cheap LLM API · Best open-source LLM · Best LLM for RAG

Methodology and sources: see About. Spotted a mistake? Open an issue.

Get the weekly LLM digest

Big releases, leaderboard movements, price drops, and the chart that mattered this week — including translation-model updates. No spam.

Or follow updates on GitHub.