The independent LLM leaderboard

Composite score, raw benchmark numbers, current API pricing, and direct links to try every major model — refreshed continuously.

30 models tracked 6 benchmarks Last updated 2026-05-09 No paywalls · No vendor bias → Try any model from one API

Sort:

# Model Score Arena MMLU-Pro GPQA MATH Code SWE $ in / out Ctx Try

How the composite score is calculated

Each benchmark is normalised onto a 0–100 scale (Arena Elo is rescaled from a 1000–1500 band; percent benchmarks pass through unchanged). The composite is a weighted average across the available benchmarks for each model. Models with fewer than three published benchmarks are listed without a composite score so that single-benchmark coding specialists cannot displace well-rounded frontier models.

Scores are compiled from provider technical reports, public papers, and Chatbot Arena snapshots. Submit corrections via the GitHub issue tracker. Browse all models A–Z →

Popular guides & head-to-heads

Best LLM for coding (2026)guideSWE-Bench & HumanEval ranked, with API price and context window for every tier.
Best LLM for RAG (2026)guideLong-context recall, citation faithfulness, and price-per-1M ranked head-to-head.
Best LLM for agents (2026)guideTool-call reliability, SWE-Bench, and per-task cost for production agentic workloads.
Best open-source LLM (2026)guideDeepSeek R1, Llama 3.3, Qwen, and Phi-4 ranked by benchmarks and licence.
Best cheap LLM API (2026)guidePrice-per-quality rankings, hidden fees, and cost calculators.
Best free LLM API (2026)guideReal free tiers ranked: Gemini, DeepSeek, Groq, OpenRouter — with rate limits and the catch.
ChatGPT vs Claude vs GeminivsThree-way frontier head-to-head: benchmarks, pricing, and verdict by use case.
GPT-5 vs Claude Opus 4.1vsFrontier head-to-head: benchmarks, pricing, and verdict by use case.
Claude vs GeminivsAnthropic's precision vs Google's scale and 2M context.
GPT-5 vs Gemini 2.5 ProvsOpenAI's flagship vs Google's 2M-context multimodal model.
DeepSeek R1 vs GPT-5vsThe best open-weights reasoning model against the closed frontier.
Claude Sonnet 4 vs GPT-4.1vsMid-tier production workhorses compared.
Llama 3.3 70B vs Qwen 2.5 72BvsThe two strongest open-weights generalists at 70B scale.

Popular head-to-head comparisons

Pick any two models — composite score, raw benchmark numbers, API pricing, and a one-click route to try both behind the same key.

GPT-5 vs Claude Opus 4.1frontierThe two highest-scoring closed-frontier models — by composite, code, and price.
GPT-5 vs Gemini 2.5 ProfrontierOpenAI flagship vs Google's 2M-context multimodal frontier.
GPT-5 vs DeepSeek R1closed vs openThe best closed model against the best open-weights reasoning model.
GPT-5 vs Grok 4frontierOpenAI vs xAI on benchmarks, context window, and price.
Claude Opus 4.1 vs Gemini 2.5 ProfrontierAnthropic's top model against Google's 2M-context flagship.
o3 vs DeepSeek R1reasoningOpenAI's reasoning specialist vs the open-weights challenger.
Claude 3.7 Sonnet vs GPT-5frontierMid-frontier Sonnet against the new top dog.
GPT-5 mini vs Gemini 2.5 Flashfast / cheapProduction-tier fast models head-to-head on price-per-quality.
GPT-4o mini vs Gemini 2.0 Flashfast / cheapThe two cheapest mainstream APIs at high volume.
DeepSeek V3 vs GPT-4o minicheapOpen-weights value vs OpenAI's smallest production model.
Llama 3.1 405B vs DeepSeek V3openThe two heavyweight open-weights models compared.
Claude 3.5 Sonnet vs GPT-4ogeneralThe 2024-era workhorses still in heavy production use today.

Get the weekly LLM digest

Big releases, leaderboard movements, price drops, and the one chart that actually mattered this week. No spam.

Or follow updates on GitHub.