Leaderboard · Head-to-head · Updated
ChatGPT vs Claude vs Gemini in 2026
GPT-5 vs Claude Opus 4.1 vs Gemini 2.5 Pro — benchmarks, pricing, context, and a clear verdict by use case. The frontier three are closer than ever; the right choice depends on what you're shipping.
OpenRouter routes GPT-5, Claude, Gemini, DeepSeek, Llama, Qwen and 100+ other LLMs behind a single key — pay-as-you-go, no monthly minimum, transparent per-token pricing. Try OpenRouter → (affiliate · supports this site)
The headline numbers
| Metric | GPT-5 | Claude Opus 4.1 | Gemini 2.5 Pro |
|---|---|---|---|
| Arena Elo | 1410 | 1390 | 1380 |
| MMLU-Pro | 86.8 | 87.0 | 86.0 |
| GPQA Diamond | 87.3 | 79.6 | 84.0 |
| MATH | 96.7 | 95.0 | 92.0 |
| HumanEval | 95.1 | 95.4 | 92.0 |
| SWE-Bench Verified | 74.9 | 74.5 | 63.8 |
| Context window | 400k | 200k | 2M |
| Input price ($/1M) | $1.25 | $15.00 | $1.25 |
| Output price ($/1M) | $10.00 | $75.00 | $10.00 |
OpenRouter exposes GPT-5, Claude Opus 4.1, and Gemini 2.5 Pro behind a single API — same price as direct, no per-provider invoices. Try OpenRouter → (affiliate)
Verdict by use case
Coding agents — GPT-5 (narrow win)
SWE-Bench Verified measures real-world GitHub issue resolution. GPT-5 leads at 74.9%, with Claude Opus 4.1 a hair behind at 74.5% — effectively tied. Gemini 2.5 Pro at 63.8% is a clear step down for autonomous coding work, though it remains excellent for code completion and review.
Pick Claude if you care about clean refactors and conservative changes. Pick GPT-5 if you want the agent to ship the PR.
Writing & long-form prose — Claude
This category resists clean benchmarks, but Claude Opus 4.1 has the strongest reputation among professional writers, technical-doc authors, and long-form journalists. Voice consistency over 50k+ tokens is its differentiator. GPT-5 is sharper at structured writing (JSON, outlines, schemas). Gemini's writing is competent but lacks personality.
Reasoning & math — GPT-5
GPT-5 leads GPQA Diamond (87.3 vs 79.6 vs 84.0) and MATH (96.7 vs 95.0 vs 92.0). Among the three, GPT-5 (and its sibling o3) are the strongest pure reasoners. Gemini holds the middle on natural-science reasoning; Claude Opus 4.1's GPQA is the weakest of the three.
Long-context (50k+ tokens) — Gemini 2.5 Pro
Only Gemini handles 2 million tokens in a single request. For codebase analysis, multi-document RAG, or video understanding, this is decisive. Even at 200k tokens, Gemini's "needle in a haystack" recall is the strongest of the three.
Cost-sensitive production — GPT-5 or Gemini 2.5 Pro (tied)
GPT-5 and Gemini 2.5 Pro are priced identically at $1.25 input / $10 output per 1M tokens. Claude Opus 4.1 at $15/$75 is roughly 12× more expensive than either. For high-volume work, the choice between GPT-5 and Gemini comes down to model fit, not cost. Claude Opus is the wrong choice for any cost-sensitive production workload — drop down to Claude Sonnet 4 ($3/$15) instead.
Tools & structured output — GPT-5
GPT-5's tool calling, JSON mode, and function-call latency are the most polished. Claude's tool use is excellent but slightly slower; Gemini's is improving but still lags on complex multi-tool sequences.
Cheaper sub-models within each family
If you're cost-sensitive but want the same family, drop a tier:
| Family | Sub-model | MMLU-Pro | $ in / out |
|---|---|---|---|
| OpenAI | GPT-5 mini | 80.1 | $0.25 / $2.00 |
| Anthropic | Claude Sonnet 4 | 84.0 | $3.00 / $15.00 |
| Gemini 2.5 Flash | 79.0 | $0.30 / $2.50 |
For most production work, Claude Sonnet 4 / GPT-5 mini / Gemini 2.5 Flash deliver close to flagship quality at 5–20× lower price. Default to mid-tier unless you've measured a quality gap that hurts you.
What about the open-source frontier?
If you've decided "frontier closed model", but want to know what you're giving up by skipping open-weights: DeepSeek R1 (composite 78.0) is the closest open-weights model to this trio, sitting roughly between GPT-5 mini and Gemini 2.5 Flash. See best open-source LLM for the full picture.
The verdict, simplified
- "I'm building one app, pick the default" → GPT-5. Best generalist with the strongest tool ecosystem and tied-cheapest at this tier.
- "My users care about prose quality" → Claude Opus 4.1. Premium price, premium writing.
- "I need 2M tokens of context, video, or audio in" → Gemini 2.5 Pro. Only viable choice for very long inputs and native multimodal.
- "I haven't decided" → ship on OpenRouter so you can switch with a string change.
Frequently asked questions
Which is better in 2026: ChatGPT, Claude, or Gemini?
All three sit at the top of the llmrank.top composite within ~4 points of each other. The gap is small enough that fit for use case matters more than score. Coding → GPT-5. Writing → Claude Opus 4.1. Long context / multimodal → Gemini 2.5 Pro.
Which is cheapest?
GPT-5 and Gemini 2.5 Pro are tied at $1.25 input / $10 output per 1M tokens. Claude Opus 4.1 at $15/$75 is roughly 12× more expensive. For mid-tier, GPT-5 mini ($0.25/$2.00) and Gemini 2.5 Flash ($0.30/$2.50) are the value picks; Claude Sonnet 4 ($3/$15) is 10× pricier but stronger on coding.
Which has the longest context window?
Gemini 2.5 Pro at 2 million tokens — 5× GPT-5 and 10× Claude Opus 4.1. For very long inputs (codebases, multi-PDF, hour-long video), Gemini is the only frontier choice.
Which is best for ChatGPT-style apps?
If you want the literal ChatGPT experience, GPT-5 is the model behind it. If you want the same quality at lower cost, GPT-5 mini ($0.25/$2.00) is 4× cheaper and only ~6 composite points lower.
Should I use ChatGPT, Claude, or Gemini for code?
For autonomous coding agents (SWE-Bench), GPT-5 leads narrowly over Claude Opus 4.1, and both clearly beat Gemini. For inline code completion, all three are excellent — quality differences disappear at the token-by-token level.
Methodology and sources: see About. Spotted an error? Open an issue.
Get the weekly LLM digest
Frontier-model price drops, leaderboard movements, and the one chart that mattered this week. No spam.
Or follow updates on GitHub.