Leaderboard · News · Updated 2026-05-10

State of the LLM market — May 2026 snapshot

Who's on top, what fell in price, and how big the gap between open and closed actually is right now. Data pulled from the live leaderboard on the morning of 10 May 2026.

TL;DR

GPT-5 still leads on composite, but Claude Opus 4.1 and Gemini 2.5 Pro are within 2 points and trade wins on individual benchmarks.
The cheap tier did all the work this cycle. Below $1 / 1M output, you can now buy 80%+ of frontier quality from at least four providers.
DeepSeek R1 is the only open-weights model that genuinely competes with GPT-5 on reasoning — and at $0.55 input / $2.19 output, the price gap is the real story, not the score.
If you're still defaulting to Claude Opus or o1 for everyday work, you're paying 5–60× the right number.

Test the whole field on one API key.

OpenRouter routes GPT-5, Claude Opus 4.1, Gemini 2.5 Pro, DeepSeek R1, Llama 3.3 and 100+ others behind a single endpoint — pay-as-you-go, no minimum. Try OpenRouter → (affiliate · supports this site)

Top of the leaderboard

Frontier-tier composite scores are clustering inside a 5-point band — competitive parity is real now. The differentiation has moved to cost-per-quality, not raw quality:

#	Model	Composite	$ in / out	$ / quality*
1	GPT-5	~91	$1.25 / $10	low
2	Claude Opus 4.1	~89	$15 / $75	very high
3	Gemini 2.5 Pro	~88	$1.25 / $10	low
4	Grok 4	~86	$3 / $15	medium
5	DeepSeek R1	~85	$0.55 / $2.19	very low

* Approximated via output price ÷ composite score. Lower is better. Composite values are point-in-time and may shift as benchmarks update — see the live leaderboard for current numbers.

What stands out: Claude Opus 4.1 charges 60× DeepSeek R1's output price for a ~4-point composite-score lead. That gap is only worth paying for in workflows where the marginal point of quality has measurable downstream value (legal review, agentic coding on critical infra, regulated content). For everything else — coding tools, RAG, batch summarisation — the rest of the column is the rational choice.

The cheap-tier story

The most interesting movement isn't at the top — it's at the bottom of the price column. Sub-$1 / 1M output is now a fully populated tier, and the quality is more than adequate for production:

Model	$ in / out	MMLU-Pro	HumanEval	Best for
Phi-4	$0.07 / $0.14	70.4	82.6	Edge / single-GPU
Gemini 2.0 Flash	$0.10 / $0.40	76.4	~89	Multimodal at scale
GPT-4o mini	$0.15 / $0.60	73.0	87.2	OpenAI-native stacks
Qwen2.5-Coder 32B	$0.18 / $0.18	68.4	88.4	Self-host coding
Llama 3.3 70B	$0.23 / $0.40	68.9	88.4	General-purpose chat

The key observation: Phi-4 at $0.07 input gives you 70 MMLU-Pro — that's 80% of GPT-5's MMLU-Pro at 1.8% of the input price. Not every workload needs GPT-5. For most production reads, the right answer is "Phi-4 or Llama 3.3 by default, escalate to a frontier model only when the cheap path fails a quality gate".

If you haven't run the math on your own workload, the API cost calculator will tell you in 30 seconds what switching from GPT-5 to a cheap-tier default would save per year.

Open vs closed: the gap is shrinking, but not where it matters most

Open-weights models close the gap on raw benchmarks but still trail on the messier ones — agentic coding (SWE-Bench), multi-turn reasoning under noise, and tool-use reliability:

Metric	GPT-5 (closed)	DeepSeek R1 (open)	Gap
MMLU-Pro	86.8	84.0	−2.8
MATH	96.7	97.3	+0.6 (R1 wins)
GPQA	87.3	71.5	−15.8
HumanEval	95.1	~94	≈ even
SWE-Bench	74.9	49.2	−25.7

The takeaway: open-weights have hit "good enough" on knowledge-recall and pure-math benchmarks, but closed frontier models still own the hard agentic-reasoning floor. If your workload is closer to solve this MATH problem, the open path wins. If it's closer to navigate this codebase and ship a PR, you still want a closed frontier model — at least for now.

Worth noting: this gap on SWE-Bench has narrowed by ~10 points across the past year, and there's no sign of the trajectory bending. By next year's snapshot, expect open-weights to be within striking distance on agentic benchmarks too.

What's worth watching

The middle tier is being squeezed. If you're still on Claude 3.5 Sonnet or GPT-4o for general workloads, you're paying yesterday's prices for last-generation quality. Both Claude Sonnet 4 and the new Gemini 2.5 Flash deliver more for less.
Cheap-tier consolidation. Three providers now offer sub-$0.50 / 1M output models with HumanEval ≥ 80. Watch for someone (probably Google or DeepSeek) to push that floor below $0.10.
Reasoning-first defaults. The line between "chat" and "reasoning" models is blurring — GPT-5 ships with reasoning baked in, Claude Sonnet 4 does extended thinking on demand. Expect every flagship to be reasoning-capable by year-end.
Agentic-bench scores will replace Arena Elo as the marketing metric. Chatbot Arena was the right yardstick for chat quality; it's increasingly disconnected from what production users actually need.

Site updates this month

What we shipped on llmrank.top in May:

Pricing hub — every API price across 30 models, sorted cheapest first, with output-to-input ratio so you see who's hiding the cost in output tokens.
API cost calculator — type your daily token volume, get a real annual spend estimate per model. Defaults sort by cheapest first.
Three new comparison guides: best LLM for writing, ChatGPT alternatives, and Claude alternatives.
Personalised CTA across all 76 pages — the right "Try" prompt now matches the page's topic instead of the previous one-size-fits-all banner.
Sponsor page — published the rate card and placement options for the first time.

Numbers in this article are pulled from the live leaderboard on 10 May 2026. Spotted a number that's out of date? Open an issue — corrections usually ship within 24h. Affiliate disclosure: "Try OpenRouter" links earn us a small commission. The leaderboard rankings, prices, and verdicts are unaffected.

Get next month's snapshot in your inbox

One email a month: what changed, what's worth switching to, and the price moves you'd otherwise miss.

No spam. Unsubscribe in one click.