Can I self-host an LLM that matches GPT-5 quality?

Not exactly — but DeepSeek R1 comes closest. It scores 84.0% on MMLU-Pro (vs GPT-5's 86.8%) and 97.3% on MATH (vs 96.7%). The catch: R1 is a 671B MoE model that requires serious GPU infrastructure. For a single-node setup, Llama 3.1 405B or DeepSeek V3 are more practical.

Which open-source LLM licence is the most permissive?

Apache-2.0 and MIT are the most permissive. Qwen2.5-Coder 32B uses Apache-2.0; DeepSeek R1 uses MIT. Llama models use the Llama Community License which has commercial-use restrictions above 700M users and requires compliance with Meta's acceptable-use policy.

What is the cheapest way to run an open-source LLM?

For API access, Phi-4 costs $0.07 / $0.14 per 1M tokens and Qwen2.5-Coder 32B is $0.18 flat. For self-hosting, Phi-4 (14B parameters) runs on a single consumer GPU with 24GB VRAM. Qwen2.5-Coder 32B fits on one A100/H100 in fp16.

Leaderboard · Guide · Updated 2026-05-09

The best open-source LLM in 2026

Q: What is the best open-source LLM in 2026?

DeepSeek R1 is the strongest open-weights model overall with 84.0% MMLU-Pro, 97.3% MATH, and MIT licence — rivalling closed frontier models at a fraction of the cost. For self-hosting on consumer hardware, Qwen2.5-Coder 32B (Apache-2.0) is the best small model, and Llama 3.3 70B is the most versatile general-purpose open model.

Open-weights models ranked by benchmarks, licence, and self-hosting feasibility. No vendor lock-in — just the numbers.

TL;DR — pick by use case

Use case	Best pick	MMLU-Pro	Licence
Overall capability (frontier-tier)	DeepSeek R1	84.0	MIT
General-purpose workhorse	Llama 3.3 70B	68.9	Llama 3.3
Coding specialist	Qwen2.5-Coder 32B	68.4	Apache-2.0
Self-host on single GPU	Phi-4 · Qwen2.5-Coder 32B	70.4 / —	MIT / Apache-2.0
Largest open model	Llama 3.1 405B	73.3	Llama 3.1
Cheap API access	DeepSeek V3	75.9	DeepSeek

Run open models without managing GPUs.

OpenRouter hosts Llama, DeepSeek, Qwen, and 50+ open-weights models with per-token billing — no minimums. Try OpenRouter → (affiliate)

Why open-source LLMs matter in 2026

Closed frontier models (GPT-5, Claude, Gemini) are excellent, but they come with trade-offs: data privacy concerns, API rate limits, unpredictable price hikes, and terms-of-service restrictions. Open-weights models let you run inference on your own hardware, fine-tune on proprietary data, and avoid vendor lock-in entirely.

In 2026, the gap between the best open models and closed frontiers has narrowed to the point where many production workloads are better served by open weights — especially when data sovereignty or cost predictability matters.

Tier 1 — Frontier-class open models

These models are within striking distance of GPT-5 and Claude on most benchmarks.

DeepSeek R1 — the open reasoning champion

Scores: 84.0% MMLU-Pro, 97.3% MATH, 71.5% GPQA, 49.2% SWE-Bench
Size: 671B MoE (37B active per token)
Licence: MIT — completely unrestricted
API price: $0.55 input / $2.19 output per 1M tokens
Self-hosting: Needs 8× H100 or equivalent for real-time inference

R1 is the only open-weights model that beats GPT-5 on MATH (97.3% vs 96.7%) and comes within 3 points on MMLU-Pro. Its reasoning ability is genuinely frontier-class. The downside is infrastructure: this is not a model you run on a single GPU.

DeepSeek V3 — the practical alternative

Scores: 75.9% MMLU-Pro, 90.2% MATH, 42.0% SWE-Bench
Size: 671B MoE
Licence: DeepSeek License (permissive for research and commercial use)
API price: $0.27 input / $1.10 output per 1M tokens

V3 trades some reasoning depth for much lower latency and cost. It's the best value open-weights model on the market — roughly GPT-4o quality at 1/10 the price.

Llama 3.1 405B — the biggest open model

Scores: 73.3% MMLU-Pro, 73.8% MATH, 89.0% HumanEval
Size: 405B dense
Licence: Llama 3.1 Community License
API price: $2.70 flat per 1M tokens

Meta's flagship is the largest openly released dense model. Strong general capability, but expensive to self-host (needs multi-node inference) and pricier on API than DeepSeek.

Tier 2 — Workhorse models (70B scale)

These are the models most teams actually deploy. They run on 1–4 consumer GPUs and deliver 80%+ of frontier quality.

Llama 3.3 70B — the safe default

Scores: 68.9% MMLU-Pro, 77.0% MATH, 88.4% HumanEval
Self-hosting: Fits on 2× RTX 4090 (48GB total) or 1× A100 80GB
Ecosystem: Largest fine-tune ecosystem (LoRAs, quantised GGUFs, MLX)

If you want an open model with the most tooling, community support, and deployment guides, Llama 3.3 70B is the default choice. It quantises cleanly to Q4_K_M for CPU inference and has hundreds of fine-tunes on Hugging Face.

Qwen2.5 72B — the bilingual specialist

Scores: 71.1% MMLU-Pro, 83.1% MATH, 86.6% HumanEval
Strength: Superior Chinese-English bilingual capability
API price: $0.35 input / $0.40 output per 1M tokens

Qwen2.5 72B outscores Llama 3.3 70B on most STEM benchmarks and is the best choice for teams serving Chinese-speaking users. The Qwen ecosystem also has excellent vision and coding variants.

Tier 3 — Small models for edge and single-GPU

Qwen2.5-Coder 32B — best small coder

Scores: 92.7% HumanEval, 68.4% MMLU-Pro
Licence: Apache-2.0
Self-hosting: Fits on 1× A100 40GB or 2× RTX 3090

The best coding model you can self-host on a single GPU. Apache-2.0 licence means zero legal friction for commercial products.

Phi-4 — punches above its weight

Scores: 70.4% MMLU-Pro, 80.4% MATH, 82.6% HumanEval
Size: 14B — tiny for its capability
Licence: MIT
Self-hosting: Runs on 1× RTX 4090 with room to spare

Microsoft's 14B model is the efficiency king. It scores higher on MMLU-Pro than Llama 3.3 70B despite being 5× smaller. If you have limited VRAM or need low latency, Phi-4 is extraordinary.

Licence comparison

Licence	Commercial use	Modifications	Distribution	Models
MIT	✓	✓	✓	DeepSeek R1, Phi-4
Apache-2.0	✓	✓	✓	Qwen2.5-Coder 32B
Llama Community	✓ (with limits)	✓	✓ (with limits)	Llama 3.1/3.3
DeepSeek License	✓	✓	✓	DeepSeek V3

Frequently asked questions

What is the best open-source LLM in 2026?

DeepSeek R1 (MIT licence) is the strongest open-weights model overall with 84.0% MMLU-Pro and 97.3% MATH — rivalling closed frontier models. For practical self-hosting, Llama 3.3 70B has the best ecosystem, and Phi-4 (14B) delivers the most capability per parameter.

Can I use these models commercially?

Yes, with caveats. MIT and Apache-2.0 models (DeepSeek R1, Phi-4, Qwen2.5-Coder) have zero restrictions. Llama models have a 700M-user commercial cap and require compliance with Meta's acceptable-use policy. Always read the licence before shipping a product.

How much GPU memory do I need?

In fp16: Phi-4 needs ~28GB, Qwen2.5-Coder 32B needs ~64GB, Llama 3.3 70B needs ~140GB, DeepSeek V3 needs ~1.3TB. Use 4-bit quantisation (GGUF, AWQ, GPTQ) to cut these by 50–75%.

Is an open model better than GPT-5 for my use case?

If you need the absolute highest quality on complex reasoning, GPT-5 still wins. But if you value data privacy, cost predictability, custom fine-tuning, or avoiding vendor lock-in, open models are often the better business decision — especially DeepSeek R1 and V3.

Methodology and sources: see About. Spotted a number that's out of date? Open an issue.

Get the weekly LLM digest

Open-source releases, new benchmarks, and the best fine-tunes we found this week. No spam.

Or follow updates on GitHub.