Leaderboard · Guide · Updated
The best LLM for writing in 2026
Long-form essays, marketing copy, fiction, technical docs — every frontier model honest-ranked on prose quality, voice consistency, refusal rate, and price.
OpenRouter routes Claude, GPT-5, Gemini, DeepSeek, Mistral, Llama and 100+ other LLMs behind a single key — pay-as-you-go, no monthly minimum, no markup over provider pricing. Try OpenRouter → (affiliate · supports this site)
TL;DR — pick by use case
| Use case | Best pick | Strength | $ in/out (per 1M) |
|---|---|---|---|
| Long-form essays / books | Claude Opus 4.1 | Voice + 200k ctx | $15 / $75 |
| Daily blog & copywriting | Claude Sonnet 4 | Best $/quality | $3 / $15 |
| High-volume content / SEO bots | GPT-5 mini · Gemini 2.5 Flash | Throughput | $0.25 / $2 · $0.30 / $2.50 |
| Whole-manuscript editing | Gemini 2.5 Pro | 2M context | $1.25 / $10 |
| Cheap workhorse | DeepSeek V3 | Near-frontier | $0.27 / $1.10 |
Spin up Claude, GPT-5, Gemini and DeepSeek with the same prompt — one OpenRouter key, no signups across five providers. Try OpenRouter → (affiliate)
How we rank writing ability
There is no single SWE-Bench equivalent for prose, so we triangulate four signals:
- Chatbot Arena writing slice (heavy weight). Hundreds of thousands of blind A/B votes from real users on writing-shaped prompts. Currently the best public proxy for "which model produces prose people prefer".
- EQ-Bench Creative Writing v3 (medium weight). Rubric-graded long-form responses across 24 fiction and essay tasks.
- Refusal rate on benign-but-edgy prompts (medium weight). A model that refuses to write a villain monologue is functionally useless for novelists.
- Voice consistency over 4k+ tokens (heavy weight). Hand-graded from internal long-form runs.
Frontier tier — for serious long-form work
- Claude Opus 4.1 — the writers' favourite. Cleanest first drafts, strongest style-guide adherence, lowest tendency to drift into "AI voice" tics (the em-dash plague, the rule-of-three obsession). Refuses less than GPT-5 on adult/edgy content. $15 in / $75 out per 1M, 200k context. The right tool for novel chapters, op-eds, and high-stakes copy.
- GPT-5 — more linguistically flexible than Claude, better at code-switching between formal and casual registers, stronger on factual grounding. Punchier sentences. Slightly higher refusal rate on creative content. $1.25 / $10 per 1M, 400k context — significantly cheaper than Opus, which makes it the default for non-fiction and journalism.
- Claude Sonnet 4 — Opus's voice DNA at 1/5 the price. The right default for daily blogging, newsletters, and most marketing copy. $3 / $15. If you can only run one writing model in production, this is it.
- Gemini 2.5 Pro — distinctive: 2M-token context lets it edit a 1.5-million-word manuscript in one shot, or maintain consistency across a 100-chapter series bible. Prose itself is solid but a step behind Claude on voice. $1.25 / $10.
- Grok 4 — lower refusal rate than competitors, stronger contemporary cultural references. Useful for satire and current-events commentary. $3 / $15.
Mid tier — for high-volume content
- GPT-5 mini — 80% of GPT-5's writing quality at 1/5 the price ($0.25 / $2). The right pick for SEO content farms, ecom product descriptions, and customer-email auto-responders.
- Claude 3.5 Haiku — Anthropic's cheap fast model. $0.80 / $4. Solid voice, strong instruction-following on tone shifts.
- Gemini 2.5 Flash — $0.30 / $2.50. Best $/quality in the tier. Long context (1M) carries over from Pro.
Open weights — for self-hosting and EU data residency
- DeepSeek V3 (MIT licence) — surprisingly strong English prose. The cheapest path to near-frontier writing quality at $0.27 / $1.10 per 1M tokens on the official API. Also the strongest open model for Chinese, Japanese, Korean.
- Mistral Large 2 (research licence) — French/EU-jurisdiction option for organizations with data-residency requirements. Strong on European languages.
- Llama 3.3 70B (Llama community licence) — the right open default if you want a model that fits on a single H100. Voice is more "neutral newswire" than literary, but reliable.
Voice and refusal — the hidden ranking
Headline benchmarks miss the two factors that matter most to working writers:
- Voice consistency. Claude wins by a clear margin on staying in character over 4k+ tokens. GPT-5 drifts faster but recovers better when prompted. Gemini and Grok drift the most.
- Refusal rate on benign-but-edgy prompts. If you write fiction with morally complex characters, satirical commentary, or anything involving violence, romance, or politics — most models will refuse or hedge. Current ranking from least- to most-refusing: Grok 4 → Claude Opus 4.1 → DeepSeek V3 → Gemini 2.5 Pro → GPT-5 → Claude Sonnet 4.
Frequently asked questions
What's the best LLM for writing in 2026?
For long-form prose with consistent voice, Claude Opus 4.1 is the consensus top pick. For most writers the better economic choice is Claude Sonnet 4 at $3 / $15 — same voice DNA, one-fifth the price.
Is GPT-5 or Claude better for creative writing?
Claude produces more emotionally consistent prose and follows style guides more reliably. GPT-5 is more flexible and writes punchier, more varied sentences. Most fiction writers prefer Claude; most journalists and marketers prefer GPT-5.
What's the cheapest LLM that writes well?
DeepSeek V3 at $0.27 / $1.10 per 1M tokens delivers near-frontier English at roughly 1/50th the price of Claude Opus 4.1. For Chinese / Japanese / Korean it is currently the strongest cheap option.
Which LLM has the longest context for editing whole manuscripts?
Gemini 2.5 Pro at 2,000,000 tokens — roughly a 1.5-million-word book in a single prompt. GPT-5 has 400k, Claude Opus 4.1 has 200k.
Are there models with lower refusal rates for fiction?
Grok 4 has the lowest refusal rate among frontier models on benign-but-edgy fiction prompts. Claude Opus 4.1 is the next-most-permissive at frontier quality. Open-weights models you can self-host (DeepSeek V3, Llama 3.3) have effectively no refusals once you set the system prompt.
Methodology and sources: see About. Spotted a number that's out of date? Open an issue.
Get the weekly LLM digest
Big releases, leaderboard movements, price drops, and the one chart that actually mattered this week. No spam.
Or follow updates on GitHub.