Leaderboard · Guide · Updated
GPT-5 vs Claude Opus 4.1
Two models at the absolute frontier in 2026. Benchmarks say they're tied. Price, context window, and your specific use case break the tie.
OpenRouter routes GPT-5, Claude, Gemini, DeepSeek, Llama, Qwen and 100+ other LLMs behind a single key — pay-as-you-go, no monthly minimum, transparent per-token pricing. Try OpenRouter → (affiliate · supports this site)
One-sentence verdict
GPT-5 wins on price and raw math; Claude Opus 4.1 wins on agentic coding and long-form writing — but for most teams, neither flagship is the right pick. Use Claude Sonnet 4 ($3 / $15) or GPT-5 mini ($0.25 / $2) for 95% of the same quality at 5–20× lower cost.
The numbers, side-by-side
| Metric | GPT-5 | Claude Opus 4.1 | Δ (GPT-5 − Claude) |
|---|---|---|---|
| Composite (0–100) | 89.7 | 88.6 | +1.1 |
| Chatbot Arena Elo | 1410 | 1390 | +20 |
| MMLU-Pro | 86.8 | 87.0 | −0.2 |
| GPQA Diamond | 87.3 | 79.6 | +7.7 |
| MATH | 96.7 | 95.0 | +1.7 |
| HumanEval | 95.1 | 95.4 | −0.3 |
| SWE-Bench Verified | 74.9 | 74.5 | +0.4 |
| Price · input ($/1M) | $1.25 | $15.00 | −$13.75 |
| Price · output ($/1M) | $10.00 | $75.00 | −$65.00 |
| Context window | 400k | 200k | +200k |
| Output cap | 128k | 32k | +96k |
| Modalities | text, image, audio | text, image | |
| Released | 2025-08 | 2025-08 |
Numbers compiled from provider technical reports and Chatbot Arena snapshots. See methodology.
OpenRouter exposes GPT-5, Claude Opus 4.1, and 100+ other models behind a single API and a single invoice. Try OpenRouter → (affiliate)
Where GPT-5 wins
- Price. Twelve times cheaper on input, 7.5× cheaper on output. At any non-trivial scale this dominates every other consideration.
- GPQA Diamond (+7.7). The largest single-benchmark gap. GPT-5 is meaningfully stronger on graduate-level science reasoning.
- MATH (+1.7). Marginal but consistent — GPT-5 is the stronger raw mathematician.
- Context window. 400k vs 200k tokens — twice the room for codebases, books, or long-running agent state.
- Native audio modality. If your application processes voice, GPT-5 ingests audio natively where Claude needs a separate transcription step.
Where Claude Opus 4.1 wins
- HumanEval (+0.3) and qualitative coding feel. The gap on synthetic benchmarks is tiny, but multi-day Anthropic blog posts and developer reports consistently rate Claude as the stronger long-horizon agentic coder. Anecdotal but consistent.
- MMLU-Pro (+0.2). Statistical noise, but Claude is at minimum tied on broad academic knowledge.
- Writing & editorial tone. Not in this table because we don't benchmark prose, but every public preference study (Arena style-controlled, internal corp tests) puts Claude ahead on long-form writing.
- Refusal calibration. Claude is famously less prone to over-refusal on technical questions involving anything sensitive (security, dual-use, etc.).
Picking by use case
| Use case | Pick | Why |
|---|---|---|
| Production coding agent (autonomous) | Claude Opus 4.1 if budget allows, else Claude Sonnet 4 | Long-horizon multi-file edits, fewer regressions over thousands of token-turns. |
| Daily IDE pair programmer | Claude Sonnet 4 | 72.7% SWE-Bench at $3 / $15. Opus is overkill for line-by-line editing. |
| High-volume API backend | GPT-5 mini | $0.25 / $2, 60.5% SWE-Bench. Budget-friendly, more than capable. |
| Math / quant research | GPT-5 | +1.7 on MATH, +7.7 on GPQA. The clearer reasoner. |
| Customer-facing chatbot (English-first) | Claude Sonnet 4 | Best refusal calibration, best tone, half the price of Opus. |
| Voice / speech application | GPT-5 | Native audio modality — no transcription step. |
| Long-context retrieval (>200k tokens) | GPT-5 (400k) or Gemini 2.5 Pro (2M) | Claude's 200k cap is the binding constraint. |
| Mixed-team default | Both, via OpenRouter | One key, one invoice — let each engineer pick. |
The cost reality check
For a 10M-token-per-day production workload (~5M in, ~5M out — typical for a moderately busy chatbot), the daily API bill is:
- GPT-5: 5 × $1.25 + 5 × $10.00 = $56.25 / day
- Claude Opus 4.1: 5 × $15 + 5 × $75 = $450 / day
- Claude Sonnet 4: 5 × $3 + 5 × $15 = $90 / day
- GPT-5 mini: 5 × $0.25 + 5 × $2 = $11.25 / day
Claude Opus 4.1 costs $143,000 more per year than GPT-5 at this volume. The benchmarks are tied. You will need an extraordinarily strong qualitative reason to justify that gap.
Honourable mention: the model nobody asks about
DeepSeek R1 (MIT-licensed, $0.55 / $2.19) scores 84.0 on MMLU-Pro and 97.3 on MATH — the highest MATH score on the leaderboard. It's open-weights, which neither GPT-5 nor Claude is. If your reason for picking between OpenAI and Anthropic is "I don't trust either with my data", DeepSeek R1 is the answer this comparison hides.
Frequently asked questions
Is GPT-5 better than Claude Opus 4.1?
On a composite of six public benchmarks, GPT-5 (89.7) edges Claude Opus 4.1 (88.6) by roughly 1 point — within the noise floor. GPT-5 is stronger on MATH and dramatically cheaper; Claude Opus 4.1 leads on HumanEval and is preferred for long-horizon agentic coding. The right answer depends on use case, not on a single number.
Which is cheaper, GPT-5 or Claude Opus 4.1?
GPT-5 is 12× cheaper on input and 7.5× cheaper on output. $1.25 / $10 per 1M tokens versus $15 / $75. For cost-equivalent quality from Anthropic, use Claude Sonnet 4 at $3 / $15.
Which has the bigger context window?
GPT-5 supports 400,000 tokens; Claude Opus 4.1 supports 200,000. If you regularly feed the model an entire codebase, GPT-5 has a clear advantage.
Which is better for coding?
Statistically tied on SWE-Bench Verified (74.9% vs 74.5%). Claude has a slight qualitative edge on multi-file refactors; GPT-5 is more consistent on first-shot patches. For most teams, the cheaper Claude Sonnet 4 (72.7%, $3 / $15) is the practical choice over Opus.
Should I use the OpenAI / Anthropic API directly, or a router?
If you're committed to one vendor and have a contract, direct is fine. If you want to A/B-test or hedge, OpenRouter exposes both behind one API at the same per-token price (it earns its margin from volume, not markup).
Related: Best LLM for coding (2026) · GPT-5 vs Gemini 2.5 Pro · Claude Opus 4.1 vs Gemini 2.5 Pro · DeepSeek R1 vs GPT-5
Methodology and sources: see About. Spotted a number that's out of date? Open an issue.
Get the weekly LLM digest
Big releases, leaderboard movements, price drops, and the one chart that actually mattered this week.