What is the best LLM for writing in 2026?

For long-form prose with consistent voice and minimal refusals, Claude Opus 4.1 is the consensus top pick — it scores highest on Chatbot Arena's writing-quality slice and produces the cleanest first drafts. For most writers, Claude Sonnet 4 ($3 in / $15 out per 1M tokens) is the better economic choice — same voice DNA at one-fifth the price.

What is the cheapest LLM that writes well?

DeepSeek V3 ($0.27 in / $1.10 out per 1M) writes English at near-frontier quality and is roughly 50× cheaper than Claude Opus 4.1. For non-English (CJK), it is also currently the strongest cheap option.

Which LLM has the longest context for editing whole books?

Gemini 2.5 Pro at 2,000,000 tokens — roughly enough to fit a 1.5-million-word manuscript in a single prompt. No other frontier model is close: GPT-5 has 400k, Claude Opus 4.1 has 200k.

Leaderboard · Guide · Updated 2026-05-10

The best LLM for writing in 2026

Q: Is GPT-5 or Claude better for creative writing?

Claude tends to produce more emotionally consistent prose and follows style guides more reliably. GPT-5 is more flexible and tends to write punchier, more varied sentences. For fiction and personal essays many writers prefer Claude; for marketing and journalism GPT-5 often feels less predictable.

Long-form essays, marketing copy, fiction, technical docs — every frontier model honest-ranked on prose quality, voice consistency, refusal rate, and price.

TL;DR — pick by use case

Use case	Best pick	Strength	$ in/out (per 1M)
Long-form essays / books	Claude Opus 4.1	Voice + 200k ctx	$15 / $75
Daily blog & copywriting	Claude Sonnet 4	Best $/quality	$3 / $15
High-volume content / SEO bots	GPT-5 mini · Gemini 2.5 Flash	Throughput	$0.25 / $2 · $0.30 / $2.50
Whole-manuscript editing	Gemini 2.5 Pro	2M context	$1.25 / $10
Cheap workhorse	DeepSeek V3	Near-frontier	$0.27 / $1.10

Test these models side-by-side.

Spin up Claude, GPT-5, Gemini and DeepSeek with the same prompt — one OpenRouter key, no signups across five providers. Try OpenRouter → (affiliate)

How we rank writing ability

There is no single SWE-Bench equivalent for prose, so we triangulate four signals:

Chatbot Arena writing slice (heavy weight). Hundreds of thousands of blind A/B votes from real users on writing-shaped prompts. Currently the best public proxy for "which model produces prose people prefer".
EQ-Bench Creative Writing v3 (medium weight). Rubric-graded long-form responses across 24 fiction and essay tasks.
Refusal rate on benign-but-edgy prompts (medium weight). A model that refuses to write a villain monologue is functionally useless for novelists.
Voice consistency over 4k+ tokens (heavy weight). Hand-graded from internal long-form runs.

Frontier tier — for serious long-form work

Claude Opus 4.1 — the writers' favourite. Cleanest first drafts, strongest style-guide adherence, lowest tendency to drift into "AI voice" tics (the em-dash plague, the rule-of-three obsession). Refuses less than GPT-5 on adult/edgy content. $15 in / $75 out per 1M, 200k context. The right tool for novel chapters, op-eds, and high-stakes copy.
GPT-5 — more linguistically flexible than Claude, better at code-switching between formal and casual registers, stronger on factual grounding. Punchier sentences. Slightly higher refusal rate on creative content. $1.25 / $10 per 1M, 400k context — significantly cheaper than Opus, which makes it the default for non-fiction and journalism.
Claude Sonnet 4 — Opus's voice DNA at 1/5 the price. The right default for daily blogging, newsletters, and most marketing copy. $3 / $15. If you can only run one writing model in production, this is it.
Gemini 2.5 Pro — distinctive: 2M-token context lets it edit a 1.5-million-word manuscript in one shot, or maintain consistency across a 100-chapter series bible. Prose itself is solid but a step behind Claude on voice. $1.25 / $10.
Grok 4 — lower refusal rate than competitors, stronger contemporary cultural references. Useful for satire and current-events commentary. $3 / $15.

Mid tier — for high-volume content

GPT-5 mini — 80% of GPT-5's writing quality at 1/5 the price ($0.25 / $2). The right pick for SEO content farms, ecom product descriptions, and customer-email auto-responders.
Claude 3.5 Haiku — Anthropic's cheap fast model. $0.80 / $4. Solid voice, strong instruction-following on tone shifts.
Gemini 2.5 Flash — $0.30 / $2.50. Best $/quality in the tier. Long context (1M) carries over from Pro.

Open weights — for self-hosting and EU data residency

DeepSeek V3 (MIT licence) — surprisingly strong English prose. The cheapest path to near-frontier writing quality at $0.27 / $1.10 per 1M tokens on the official API. Also the strongest open model for Chinese, Japanese, Korean.
Mistral Large 2 (research licence) — French/EU-jurisdiction option for organizations with data-residency requirements. Strong on European languages.
Llama 3.3 70B (Llama community licence) — the right open default if you want a model that fits on a single H100. Voice is more "neutral newswire" than literary, but reliable.

Voice and refusal — the hidden ranking

Headline benchmarks miss the two factors that matter most to working writers:

Voice consistency. Claude wins by a clear margin on staying in character over 4k+ tokens. GPT-5 drifts faster but recovers better when prompted. Gemini and Grok drift the most.
Refusal rate on benign-but-edgy prompts. If you write fiction with morally complex characters, satirical commentary, or anything involving violence, romance, or politics — most models will refuse or hedge. Current ranking from least- to most-refusing: Grok 4 → Claude Opus 4.1 → DeepSeek V3 → Gemini 2.5 Pro → GPT-5 → Claude Sonnet 4.

Frequently asked questions

What's the best LLM for writing in 2026?

For long-form prose with consistent voice, Claude Opus 4.1 is the consensus top pick. For most writers the better economic choice is Claude Sonnet 4 at $3 / $15 — same voice DNA, one-fifth the price.

Is GPT-5 or Claude better for creative writing?

Claude produces more emotionally consistent prose and follows style guides more reliably. GPT-5 is more flexible and writes punchier, more varied sentences. Most fiction writers prefer Claude; most journalists and marketers prefer GPT-5.

What's the cheapest LLM that writes well?

DeepSeek V3 at $0.27 / $1.10 per 1M tokens delivers near-frontier English at roughly 1/50th the price of Claude Opus 4.1. For Chinese / Japanese / Korean it is currently the strongest cheap option.

Which LLM has the longest context for editing whole manuscripts?

Gemini 2.5 Pro at 2,000,000 tokens — roughly a 1.5-million-word book in a single prompt. GPT-5 has 400k, Claude Opus 4.1 has 200k.

Are there models with lower refusal rates for fiction?

Grok 4 has the lowest refusal rate among frontier models on benign-but-edgy fiction prompts. Claude Opus 4.1 is the next-most-permissive at frontier quality. Open-weights models you can self-host (DeepSeek V3, Llama 3.3) have effectively no refusals once you set the system prompt.

Methodology and sources: see About. Spotted a number that's out of date? Open an issue.

Get the weekly LLM digest

Big releases, leaderboard movements, price drops, and the one chart that actually mattered this week. No spam.

Or follow updates on GitHub.