Leaderboard · Compare · 30 matchups · Updated
LLM head-to-head comparisons
Every popular "X vs Y" matchup on one page. Each link goes to a side-by-side page with composite score, raw benchmark numbers, API pricing, cost-at-scale, and a verdict by use case.
OpenRouter routes GPT-5, Claude, Gemini, DeepSeek, Llama, Grok, Qwen, Mistral and 100+ others behind a single key — pay-as-you-go, no monthly minimum. Try OpenRouter → (affiliate · supports this site)
Frontier head-to-heads
The closed flagships fighting for #1 on the leaderboard.
- GPT-5 vs Claude Opus 4.1OpenAI · AnthropicComposite 86.0 vs 83.6 · benchmarks · pricing · verdict
- GPT-5 vs Gemini 2.5 ProOpenAI · GoogleComposite 86.0 vs 80.9 · benchmarks · pricing · verdict
- GPT-5 vs Grok 4OpenAI · xAIComposite 86.0 vs 83.6 · benchmarks · pricing · verdict
- Claude Opus 4.1 vs Gemini 2.5 ProAnthropic · GoogleComposite 83.6 vs 80.9 · benchmarks · pricing · verdict
- o3 vs GPT-5OpenAI · OpenAIComposite 83.7 vs 86.0 · benchmarks · pricing · verdict
- Gemini 2.5 Pro vs GPT-5Google · OpenAIComposite 80.9 vs 86.0 · benchmarks · pricing · verdict
- Grok 4 vs Claude Opus 4.1xAI · AnthropicComposite 83.6 vs 83.6 · benchmarks · pricing · verdict
- Grok 4 vs Gemini 2.5 ProxAI · GoogleComposite 83.6 vs 80.9 · benchmarks · pricing · verdict
Closed vs open
Top closed-source model against the best open-weights challenger.
- GPT-5 vs DeepSeek R1OpenAI · DeepSeekComposite 86.0 vs 75.4 · benchmarks · pricing · verdict
- Claude Opus 4.1 vs DeepSeek R1Anthropic · DeepSeekComposite 83.6 vs 75.4 · benchmarks · pricing · verdict
- o3 vs DeepSeek R1OpenAI · DeepSeekComposite 83.7 vs 75.4 · benchmarks · pricing · verdict
- DeepSeek V3 vs GPT-4o miniDeepSeek · OpenAIComposite 68.0 vs 61.3 · benchmarks · pricing · verdict
- DeepSeek R1 vs GPT-5DeepSeek · OpenAIComposite 75.4 vs 86.0 · benchmarks · pricing · verdict
- Gemini 2.5 Pro vs DeepSeek R1Google · DeepSeekComposite 80.9 vs 75.4 · benchmarks · pricing · verdict
- Qwen2.5-Coder 32B vs Codestral 25.01Alibaba · Mistral AIComposite 68.8 vs — · benchmarks · pricing · verdict
Cheap / fast tier
Production-volume mini models compared on price-per-quality.
- GPT-5 mini vs Gemini 2.5 FlashOpenAI · GoogleComposite 77.0 vs 73.3 · benchmarks · pricing · verdict
- GPT-5 mini vs Claude 3.5 HaikuOpenAI · AnthropicComposite 77.0 vs 56.2 · benchmarks · pricing · verdict
- GPT-4o mini vs Gemini 2.0 FlashOpenAI · GoogleComposite 61.3 vs 65.6 · benchmarks · pricing · verdict
- Claude 3.5 Haiku vs GPT-4o miniAnthropic · OpenAIComposite 56.2 vs 61.3 · benchmarks · pricing · verdict
Open-weights duels
Self-hostable models head-to-head — DeepSeek, Llama, Qwen, Mistral, Phi.
- Llama 3.3 70B Instruct vs Qwen2.5 72B InstructMeta · AlibabaComposite 64.7 vs 65.6 · benchmarks · pricing · verdict
- Llama 3.1 405B Instruct vs DeepSeek V3Meta · DeepSeekComposite 65.7 vs 68.0 · benchmarks · pricing · verdict
- Qwen2.5 72B Instruct vs DeepSeek V3Alibaba · DeepSeekComposite 65.6 vs 68.0 · benchmarks · pricing · verdict
- Llama 3.3 70B Instruct vs DeepSeek V3Meta · DeepSeekComposite 64.7 vs 68.0 · benchmarks · pricing · verdict
- Phi-4 vs Llama 3.3 70B InstructMicrosoft · MetaComposite 71.2 vs 64.7 · benchmarks · pricing · verdict
Cross-tier & cross-vendor
Other matchups — different tiers or vendors.
- GPT-5 vs Claude Sonnet 4OpenAI · AnthropicComposite 86.0 vs 80.7 · benchmarks · pricing · verdict
- Claude 3.7 Sonnet vs GPT-5Anthropic · OpenAIComposite 76.0 vs 86.0 · benchmarks · pricing · verdict
- Claude Sonnet 4 vs GPT-4.1Anthropic · OpenAIComposite 80.7 vs 74.5 · benchmarks · pricing · verdict
- Claude 3.5 Sonnet vs GPT-4oAnthropic · OpenAIComposite 69.1 vs 66.8 · benchmarks · pricing · verdict
- Mistral Large 2 vs Claude Sonnet 4Mistral AI · AnthropicComposite 63.7 vs 80.7 · benchmarks · pricing · verdict
- Mistral Large 2 vs GPT-4.1Mistral AI · OpenAIComposite 63.7 vs 74.5 · benchmarks · pricing · verdict
Looking for a specific pair we don't list? Use the custom 2-model comparison tool — every model on the leaderboard can be picked from the dropdown.