Which is better for coding, GPT-5 or Claude?

They are essentially tied: GPT-5 scores 74.9% on SWE-Bench Verified, Claude Opus 4.1 scores 74.5%. Claude has a slight edge on long-horizon multi-file refactors in our testing; GPT-5 is more consistent on first-shot patches and one-off scripts. For most teams, the cheaper Claude Sonnet 4 (72.7% SWE-Bench, $3 / $15) is the practical choice over Opus.

Leaderboard · Guide · Updated 2026-05-09

GPT-5 vs Claude Opus 4.1

Q: Is GPT-5 better than Claude Opus 4.1?

On a composite of six public benchmarks, GPT-5 (89.7) edges Claude Opus 4.1 (88.6) by roughly 1 point — within the noise floor. GPT-5 is stronger on MATH and slightly cheaper; Claude Opus 4.1 leads on HumanEval and is preferred for long-horizon agentic coding. The right answer depends on use case, not on a single number.

Q: Which is cheaper, GPT-5 or Claude Opus 4.1?

GPT-5 is dramatically cheaper. GPT-5 costs $1.25 input / $10 output per 1M tokens. Claude Opus 4.1 costs $15 input / $75 output — twelve times more expensive on input and over seven times more on output. For cost-equivalent quality from Anthropic, use Claude Sonnet 4 ($3 / $15).

Q: Which has the bigger context window?

GPT-5 supports a 400,000-token context window. Claude Opus 4.1 supports 200,000 tokens. If you regularly feed the model an entire codebase, GPT-5 has a clear advantage; for typical document or chat workloads, both are more than enough.

Two models at the absolute frontier in 2026. Benchmarks say they're tied. Price, context window, and your specific use case break the tie.

One-sentence verdict

GPT-5 wins on price and raw math; Claude Opus 4.1 wins on agentic coding and long-form writing — but for most teams, neither flagship is the right pick. Use Claude Sonnet 4 ($3 / $15) or GPT-5 mini ($0.25 / $2) for 95% of the same quality at 5–20× lower cost.

The numbers, side-by-side

Metric	GPT-5	Claude Opus 4.1	Δ (GPT-5 − Claude)
Composite (0–100)	89.7	88.6	+1.1
Chatbot Arena Elo	1410	1390	+20
MMLU-Pro	86.8	87.0	−0.2
GPQA Diamond	87.3	79.6	+7.7
MATH	96.7	95.0	+1.7
HumanEval	95.1	95.4	−0.3
SWE-Bench Verified	74.9	74.5	+0.4
Price · input ($/1M)	$1.25	$15.00	−$13.75
Price · output ($/1M)	$10.00	$75.00	−$65.00
Context window	400k	200k	+200k
Output cap	128k	32k	+96k
Modalities	text, image, audio	text, image
Released	2025-08	2025-08

Numbers compiled from provider technical reports and Chatbot Arena snapshots. See methodology.

Open in interactive compare → Try GPT-5 → Try Claude Opus 4.1 →

Use both without two billing relationships.

OpenRouter exposes GPT-5, Claude Opus 4.1, and 100+ other models behind a single API and a single invoice. Try OpenRouter → (affiliate)

Where GPT-5 wins

Price. Twelve times cheaper on input, 7.5× cheaper on output. At any non-trivial scale this dominates every other consideration.
GPQA Diamond (+7.7). The largest single-benchmark gap. GPT-5 is meaningfully stronger on graduate-level science reasoning.
MATH (+1.7). Marginal but consistent — GPT-5 is the stronger raw mathematician.
Context window. 400k vs 200k tokens — twice the room for codebases, books, or long-running agent state.
Native audio modality. If your application processes voice, GPT-5 ingests audio natively where Claude needs a separate transcription step.

Where Claude Opus 4.1 wins

HumanEval (+0.3) and qualitative coding feel. The gap on synthetic benchmarks is tiny, but multi-day Anthropic blog posts and developer reports consistently rate Claude as the stronger long-horizon agentic coder. Anecdotal but consistent.
MMLU-Pro (+0.2). Statistical noise, but Claude is at minimum tied on broad academic knowledge.
Writing & editorial tone. Not in this table because we don't benchmark prose, but every public preference study (Arena style-controlled, internal corp tests) puts Claude ahead on long-form writing.
Refusal calibration. Claude is famously less prone to over-refusal on technical questions involving anything sensitive (security, dual-use, etc.).

Picking by use case

Use case	Pick	Why
Production coding agent (autonomous)	Claude Opus 4.1 if budget allows, else Claude Sonnet 4	Long-horizon multi-file edits, fewer regressions over thousands of token-turns.
Daily IDE pair programmer	Claude Sonnet 4	72.7% SWE-Bench at $3 / $15. Opus is overkill for line-by-line editing.
High-volume API backend	GPT-5 mini	$0.25 / $2, 60.5% SWE-Bench. Budget-friendly, more than capable.
Math / quant research	GPT-5	+1.7 on MATH, +7.7 on GPQA. The clearer reasoner.
Customer-facing chatbot (English-first)	Claude Sonnet 4	Best refusal calibration, best tone, half the price of Opus.
Voice / speech application	GPT-5	Native audio modality — no transcription step.
Long-context retrieval (>200k tokens)	GPT-5 (400k) or Gemini 2.5 Pro (2M)	Claude's 200k cap is the binding constraint.
Mixed-team default	Both, via OpenRouter	One key, one invoice — let each engineer pick.

The cost reality check

For a 10M-token-per-day production workload (~5M in, ~5M out — typical for a moderately busy chatbot), the daily API bill is:

GPT-5: 5 × $1.25 + 5 × $10.00 = $56.25 / day
Claude Opus 4.1: 5 × $15 + 5 × $75 = $450 / day
Claude Sonnet 4: 5 × $3 + 5 × $15 = $90 / day
GPT-5 mini: 5 × $0.25 + 5 × $2 = $11.25 / day

Claude Opus 4.1 costs $143,000 more per year than GPT-5 at this volume. The benchmarks are tied. You will need an extraordinarily strong qualitative reason to justify that gap.

Honourable mention: the model nobody asks about

DeepSeek R1 (MIT-licensed, $0.55 / $2.19) scores 84.0 on MMLU-Pro and 97.3 on MATH — the highest MATH score on the leaderboard. It's open-weights, which neither GPT-5 nor Claude is. If your reason for picking between OpenAI and Anthropic is "I don't trust either with my data", DeepSeek R1 is the answer this comparison hides.

Frequently asked questions

Is GPT-5 better than Claude Opus 4.1?

On a composite of six public benchmarks, GPT-5 (89.7) edges Claude Opus 4.1 (88.6) by roughly 1 point — within the noise floor. GPT-5 is stronger on MATH and dramatically cheaper; Claude Opus 4.1 leads on HumanEval and is preferred for long-horizon agentic coding. The right answer depends on use case, not on a single number.

Which is cheaper, GPT-5 or Claude Opus 4.1?

GPT-5 is 12× cheaper on input and 7.5× cheaper on output. $1.25 / $10 per 1M tokens versus $15 / $75. For cost-equivalent quality from Anthropic, use Claude Sonnet 4 at $3 / $15.

Which has the bigger context window?

GPT-5 supports 400,000 tokens; Claude Opus 4.1 supports 200,000. If you regularly feed the model an entire codebase, GPT-5 has a clear advantage.

Which is better for coding?

Statistically tied on SWE-Bench Verified (74.9% vs 74.5%). Claude has a slight qualitative edge on multi-file refactors; GPT-5 is more consistent on first-shot patches. For most teams, the cheaper Claude Sonnet 4 (72.7%, $3 / $15) is the practical choice over Opus.

Should I use the OpenAI / Anthropic API directly, or a router?

If you're committed to one vendor and have a contract, direct is fine. If you want to A/B-test or hedge, OpenRouter exposes both behind one API at the same per-token price (it earns its margin from volume, not markup).

Methodology and sources: see About. Spotted a number that's out of date? Open an issue.

Get the weekly LLM digest

Big releases, leaderboard movements, price drops, and the one chart that actually mattered this week.