Leaderboard · Guide · Updated
The best open-source LLM in 2026
Open-weights models ranked by benchmarks, licence, and self-hosting feasibility. No vendor lock-in — just the numbers.
OpenRouter routes GPT-5, Claude, Gemini, DeepSeek, Llama, Qwen and 100+ other LLMs behind a single key — pay-as-you-go, no monthly minimum, transparent per-token pricing. Try OpenRouter → (affiliate · supports this site)
TL;DR — pick by use case
| Use case | Best pick | MMLU-Pro | Licence |
|---|---|---|---|
| Overall capability (frontier-tier) | DeepSeek R1 | 84.0 | MIT |
| General-purpose workhorse | Llama 3.3 70B | 68.9 | Llama 3.3 |
| Coding specialist | Qwen2.5-Coder 32B | 68.4 | Apache-2.0 |
| Self-host on single GPU | Phi-4 · Qwen2.5-Coder 32B | 70.4 / — | MIT / Apache-2.0 |
| Largest open model | Llama 3.1 405B | 73.3 | Llama 3.1 |
| Cheap API access | DeepSeek V3 | 75.9 | DeepSeek |
OpenRouter hosts Llama, DeepSeek, Qwen, and 50+ open-weights models with per-token billing — no minimums. Try OpenRouter → (affiliate)
Why open-source LLMs matter in 2026
Closed frontier models (GPT-5, Claude, Gemini) are excellent, but they come with trade-offs: data privacy concerns, API rate limits, unpredictable price hikes, and terms-of-service restrictions. Open-weights models let you run inference on your own hardware, fine-tune on proprietary data, and avoid vendor lock-in entirely.
In 2026, the gap between the best open models and closed frontiers has narrowed to the point where many production workloads are better served by open weights — especially when data sovereignty or cost predictability matters.
Tier 1 — Frontier-class open models
These models are within striking distance of GPT-5 and Claude on most benchmarks.
DeepSeek R1 — the open reasoning champion
- Scores: 84.0% MMLU-Pro, 97.3% MATH, 71.5% GPQA, 49.2% SWE-Bench
- Size: 671B MoE (37B active per token)
- Licence: MIT — completely unrestricted
- API price: $0.55 input / $2.19 output per 1M tokens
- Self-hosting: Needs 8× H100 or equivalent for real-time inference
R1 is the only open-weights model that beats GPT-5 on MATH (97.3% vs 96.7%) and comes within 3 points on MMLU-Pro. Its reasoning ability is genuinely frontier-class. The downside is infrastructure: this is not a model you run on a single GPU.
DeepSeek V3 — the practical alternative
- Scores: 75.9% MMLU-Pro, 90.2% MATH, 42.0% SWE-Bench
- Size: 671B MoE
- Licence: DeepSeek License (permissive for research and commercial use)
- API price: $0.27 input / $1.10 output per 1M tokens
V3 trades some reasoning depth for much lower latency and cost. It's the best value open-weights model on the market — roughly GPT-4o quality at 1/10 the price.
Llama 3.1 405B — the biggest open model
- Scores: 73.3% MMLU-Pro, 73.8% MATH, 89.0% HumanEval
- Size: 405B dense
- Licence: Llama 3.1 Community License
- API price: $2.70 flat per 1M tokens
Meta's flagship is the largest openly released dense model. Strong general capability, but expensive to self-host (needs multi-node inference) and pricier on API than DeepSeek.
Tier 2 — Workhorse models (70B scale)
These are the models most teams actually deploy. They run on 1–4 consumer GPUs and deliver 80%+ of frontier quality.
Llama 3.3 70B — the safe default
- Scores: 68.9% MMLU-Pro, 77.0% MATH, 88.4% HumanEval
- Self-hosting: Fits on 2× RTX 4090 (48GB total) or 1× A100 80GB
- Ecosystem: Largest fine-tune ecosystem (LoRAs, quantised GGUFs, MLX)
If you want an open model with the most tooling, community support, and deployment guides, Llama 3.3 70B is the default choice. It quantises cleanly to Q4_K_M for CPU inference and has hundreds of fine-tunes on Hugging Face.
Qwen2.5 72B — the bilingual specialist
- Scores: 71.1% MMLU-Pro, 83.1% MATH, 86.6% HumanEval
- Strength: Superior Chinese-English bilingual capability
- API price: $0.35 input / $0.40 output per 1M tokens
Qwen2.5 72B outscores Llama 3.3 70B on most STEM benchmarks and is the best choice for teams serving Chinese-speaking users. The Qwen ecosystem also has excellent vision and coding variants.
Tier 3 — Small models for edge and single-GPU
Qwen2.5-Coder 32B — best small coder
- Scores: 92.7% HumanEval, 68.4% MMLU-Pro
- Licence: Apache-2.0
- Self-hosting: Fits on 1× A100 40GB or 2× RTX 3090
The best coding model you can self-host on a single GPU. Apache-2.0 licence means zero legal friction for commercial products.
Phi-4 — punches above its weight
- Scores: 70.4% MMLU-Pro, 80.4% MATH, 82.6% HumanEval
- Size: 14B — tiny for its capability
- Licence: MIT
- Self-hosting: Runs on 1× RTX 4090 with room to spare
Microsoft's 14B model is the efficiency king. It scores higher on MMLU-Pro than Llama 3.3 70B despite being 5× smaller. If you have limited VRAM or need low latency, Phi-4 is extraordinary.
Licence comparison
| Licence | Commercial use | Modifications | Distribution | Models |
|---|---|---|---|---|
| MIT | ✓ | ✓ | ✓ | DeepSeek R1, Phi-4 |
| Apache-2.0 | ✓ | ✓ | ✓ | Qwen2.5-Coder 32B |
| Llama Community | ✓ (with limits) | ✓ | ✓ (with limits) | Llama 3.1/3.3 |
| DeepSeek License | ✓ | ✓ | ✓ | DeepSeek V3 |
Frequently asked questions
What is the best open-source LLM in 2026?
DeepSeek R1 (MIT licence) is the strongest open-weights model overall with 84.0% MMLU-Pro and 97.3% MATH — rivalling closed frontier models. For practical self-hosting, Llama 3.3 70B has the best ecosystem, and Phi-4 (14B) delivers the most capability per parameter.
Can I use these models commercially?
Yes, with caveats. MIT and Apache-2.0 models (DeepSeek R1, Phi-4, Qwen2.5-Coder) have zero restrictions. Llama models have a 700M-user commercial cap and require compliance with Meta's acceptable-use policy. Always read the licence before shipping a product.
How much GPU memory do I need?
In fp16: Phi-4 needs ~28GB, Qwen2.5-Coder 32B needs ~64GB, Llama 3.3 70B needs ~140GB, DeepSeek V3 needs ~1.3TB. Use 4-bit quantisation (GGUF, AWQ, GPTQ) to cut these by 50–75%.
Is an open model better than GPT-5 for my use case?
If you need the absolute highest quality on complex reasoning, GPT-5 still wins. But if you value data privacy, cost predictability, custom fine-tuning, or avoiding vendor lock-in, open models are often the better business decision — especially DeepSeek R1 and V3.
Methodology and sources: see About. Spotted a number that's out of date? Open an issue.
Get the weekly LLM digest
Open-source releases, new benchmarks, and the best fine-tunes we found this week. No spam.
Or follow updates on GitHub.