LLM Rank.top

Leaderboard · News · Updated

State of the LLM market — May 2026 snapshot

Who's on top, what fell in price, and how big the gap between open and closed actually is right now. Data pulled from the live leaderboard on the morning of 10 May 2026.

TL;DR

Test the whole field on one API key.

OpenRouter routes GPT-5, Claude Opus 4.1, Gemini 2.5 Pro, DeepSeek R1, Llama 3.3 and 100+ others behind a single endpoint — pay-as-you-go, no minimum. Try OpenRouter → (affiliate · supports this site)

Top of the leaderboard

Frontier-tier composite scores are clustering inside a 5-point band — competitive parity is real now. The differentiation has moved to cost-per-quality, not raw quality:

#ModelComposite$ in / out$ / quality*
1GPT-5~91$1.25 / $10low
2Claude Opus 4.1~89$15 / $75very high
3Gemini 2.5 Pro~88$1.25 / $10low
4Grok 4~86$3 / $15medium
5DeepSeek R1~85$0.55 / $2.19very low

* Approximated via output price ÷ composite score. Lower is better. Composite values are point-in-time and may shift as benchmarks update — see the live leaderboard for current numbers.

What stands out: Claude Opus 4.1 charges 60× DeepSeek R1's output price for a ~4-point composite-score lead. That gap is only worth paying for in workflows where the marginal point of quality has measurable downstream value (legal review, agentic coding on critical infra, regulated content). For everything else — coding tools, RAG, batch summarisation — the rest of the column is the rational choice.

The cheap-tier story

The most interesting movement isn't at the top — it's at the bottom of the price column. Sub-$1 / 1M output is now a fully populated tier, and the quality is more than adequate for production:

Model$ in / outMMLU-ProHumanEvalBest for
Phi-4$0.07 / $0.1470.482.6Edge / single-GPU
Gemini 2.0 Flash$0.10 / $0.4076.4~89Multimodal at scale
GPT-4o mini$0.15 / $0.6073.087.2OpenAI-native stacks
Qwen2.5-Coder 32B$0.18 / $0.1868.488.4Self-host coding
Llama 3.3 70B$0.23 / $0.4068.988.4General-purpose chat

The key observation: Phi-4 at $0.07 input gives you 70 MMLU-Pro — that's 80% of GPT-5's MMLU-Pro at 1.8% of the input price. Not every workload needs GPT-5. For most production reads, the right answer is "Phi-4 or Llama 3.3 by default, escalate to a frontier model only when the cheap path fails a quality gate".

If you haven't run the math on your own workload, the API cost calculator will tell you in 30 seconds what switching from GPT-5 to a cheap-tier default would save per year.

Open vs closed: the gap is shrinking, but not where it matters most

Open-weights models close the gap on raw benchmarks but still trail on the messier ones — agentic coding (SWE-Bench), multi-turn reasoning under noise, and tool-use reliability:

MetricGPT-5 (closed)DeepSeek R1 (open)Gap
MMLU-Pro86.884.0−2.8
MATH96.797.3+0.6 (R1 wins)
GPQA87.371.5−15.8
HumanEval95.1~94≈ even
SWE-Bench74.949.2−25.7

The takeaway: open-weights have hit "good enough" on knowledge-recall and pure-math benchmarks, but closed frontier models still own the hard agentic-reasoning floor. If your workload is closer to solve this MATH problem, the open path wins. If it's closer to navigate this codebase and ship a PR, you still want a closed frontier model — at least for now.

Worth noting: this gap on SWE-Bench has narrowed by ~10 points across the past year, and there's no sign of the trajectory bending. By next year's snapshot, expect open-weights to be within striking distance on agentic benchmarks too.

What's worth watching

Site updates this month

What we shipped on llmrank.top in May:


Numbers in this article are pulled from the live leaderboard on 10 May 2026. Spotted a number that's out of date? Open an issue — corrections usually ship within 24h. Affiliate disclosure: "Try OpenRouter" links earn us a small commission. The leaderboard rankings, prices, and verdicts are unaffected.

Get next month's snapshot in your inbox

One email a month: what changed, what's worth switching to, and the price moves you'd otherwise miss.

No spam. Unsubscribe in one click.