What is the best LLM for coding in 2026?

On SWE-Bench Verified — the hardest public coding benchmark, which measures real GitHub issue resolution — GPT-5 (74.9%) and Claude Opus 4.1 (74.5%) are statistically tied at the top, followed by Grok 4 (72.0%), Claude Sonnet 4 (72.7%), and o3 (71.7%). For most teams, Claude Sonnet 4 hits the best price/performance sweet spot at $3 input / $15 output per 1M tokens.

Leaderboard · Guide · Updated 2026-05-09

The best LLM for coding in 2026

Q: What is the cheapest model that is still good at code?

DeepSeek V3 ($0.27 in / $1.10 out per 1M) and Qwen2.5-Coder 32B ($0.18 flat) deliver 90%+ HumanEval at a fraction of frontier-model cost. For pure autocomplete in an IDE, Codestral 25.01 is purpose-built and cheap.

Q: Is HumanEval still a useful benchmark?

HumanEval is saturated — almost every frontier model is above 90% — so it no longer differentiates well. Treat it as a sanity-check (anything below 85% is a red flag) and rely on SWE-Bench Verified for separating top models.

Q: What about open-weights models for coding?

DeepSeek R1 (49.2% on SWE-Bench, MIT-licensed) is the strongest open-weights coding model available; DeepSeek V3 (42% SWE-Bench) is close behind at much lower latency. Qwen2.5-Coder 32B is the best small open coder for self-hosting on a single GPU.

Ranked by SWE-Bench Verified (real GitHub issue resolution) and HumanEval, with current API price and context window. No vendor bias — just the numbers.

TL;DR — pick by use case

Use case	Best pick	SWE-Bench	$ in/out (per 1M)
Autonomous coding agent	GPT-5 · Claude Opus 4.1	74.9 / 74.5	$1.25 / $10 · $15 / $75
Daily IDE pair programmer	Claude Sonnet 4	72.7	$3 / $15
High-volume backend / batch	GPT-5 mini · Gemini 2.5 Flash	60.5 / 53.3	$0.25 / $2 · $0.30 / $2.50
Open-weights / on-prem	DeepSeek R1 · Qwen2.5-Coder 32B	49.2 / —	$0.55 / $2.19 · $0.18 flat
IDE autocomplete (FIM)	Codestral 25.01	—	$0.30 / $0.90

Try these models without juggling provider keys.

OpenRouter routes one API across every model in this article — pay-as-you-go, no monthly minimum. Try OpenRouter → (affiliate)

How we rank coding ability

Two benchmarks, weighted differently:

SWE-Bench Verified (weight: heavy). 500 hand-verified GitHub issues from popular Python repositories. The model is given the repo, the issue, and the failing test, and must produce a patch that makes the test pass. This is the closest public proxy to "useful agentic coding".
HumanEval (weight: light). 164 small Python functions with hidden unit tests. Saturated at the top end — only useful as a floor check.

If you only have time to look at one number, look at SWE-Bench. A model with 70%+ SWE-Bench can drive a coding agent end-to-end; a model below 50% will need significant scaffolding and human review.

Frontier tier · 70%+ on SWE-Bench

Four models are statistically tied at the top in early 2026. Treat differences below ~3 percentage points as noise.

GPT-5 — 74.9% SWE-Bench, 95.1% HumanEval. The most consistent end-to-end agentic performance. $1.25 input / $10 output per 1M tokens, 400k context. The default if you have OpenAI access.
Claude Opus 4.1 — 74.5% SWE-Bench, 95.4% HumanEval. Marginally better at long-horizon refactors and multi-file edits in our testing. Expensive ($15 in / $75 out) — reserve for hard work, fall back to Sonnet for the rest.
Grok 4 — 72.0% SWE-Bench. Strong reasoning, fast inference. $3 / $15 per 1M tokens.
Claude Sonnet 4 — 72.7% SWE-Bench, $3 / $15. The best price/performance ratio at the frontier. This is what most teams should default to in an IDE plugin.
OpenAI o3 — 71.7% SWE-Bench. Reasoning model — emits long chains of thought, slower but more reliable on hard bugs. $2 / $8 per 1M tokens.

Mid tier · 50–70% on SWE-Bench

Gemini 2.5 Pro — 63.8% SWE-Bench. 2M-token context is unique advantage if you want the model to read an entire monorepo at once.
Claude 3.7 Sonnet — 62.3% SWE-Bench. Hybrid extended-thinking mode is useful for tricky bugs.
GPT-5 mini — 60.5% SWE-Bench at 1/5 the price of full GPT-5. Excellent for high-volume bot traffic.
GPT-4.1 — 54.6% SWE-Bench, 1M context. A workhorse if you need long context plus moderate price.
Gemini 2.5 Flash — 53.3% SWE-Bench at $0.30 / $2.50. Best $/quality ratio in the tier.

Open weights — best for self-hosting

If you need on-prem or want to avoid lock-in, two clear winners:

DeepSeek R1 (MIT licence) — 49.2% SWE-Bench, 92.0% HumanEval. The only open-weights model in striking distance of frontier closed models. 671B MoE — needs serious GPUs to self-host, but priced at $0.55 / $2.19 on the official API.
DeepSeek V3 — 42.0% SWE-Bench, faster than R1 and a quarter of the price. Use when you don't need extended reasoning.
Qwen2.5-Coder 32B (Apache-2.0) — 92.7% HumanEval. Fits on a single A100/H100 in fp16. Strong autocomplete + general code chat. The best small open coder available.

Cheap workhorses · for backends and bots

Gemini 2.0 Flash — $0.10 / $0.40 per 1M tokens. The cheapest production-grade coding model, multimodal, 1M context. Good enough for boilerplate generation and lint-fix bots.
Claude 3.5 Haiku — $0.80 / $4. Fast and consistent on small refactors and code review.
Phi-4 (MIT) — Microsoft's 14B punching above its weight on STEM. Cheap to self-host.

What about IDE autocomplete specifically?

For sub-200ms fill-in-the-middle (FIM) latency in a Cursor/Copilot-style integration, the model architecture matters more than headline benchmark scores. Codestral 25.01 is purpose-built for this — 256k context, $0.30 / $0.90 pricing, and FIM is a first-class citizen. Qwen2.5-Coder is the best self-hostable alternative.

Frequently asked questions

What's the best LLM for coding in 2026?

On SWE-Bench Verified (the hardest public coding benchmark), GPT-5 (74.9%) and Claude Opus 4.1 (74.5%) are statistically tied at the top. For most teams, Claude Sonnet 4 hits the best price/performance sweet spot at $3 / $15 per 1M tokens with 72.7% SWE-Bench.

What is the cheapest model that is still good at code?

DeepSeek V3 ($0.27 / $1.10) and Qwen2.5-Coder 32B ($0.18 flat) deliver 90%+ HumanEval at a fraction of frontier-model cost. For pure IDE autocomplete, Codestral 25.01 is purpose-built and cheap.

Is HumanEval still a useful benchmark?

HumanEval is saturated — almost every frontier model scores above 90%, so it no longer differentiates well. Treat it as a sanity check (anything below 85% is a red flag) and rely on SWE-Bench Verified for separating top models.

What about open-weights models for coding?

DeepSeek R1 (49.2% SWE-Bench, MIT-licensed) is the strongest open-weights coding model. Qwen2.5-Coder 32B is the best small open coder for self-hosting on a single GPU.

Should I use a reasoning model (o3, R1) or a regular model for coding?

Reasoning models trade latency for reliability on hard bugs. For interactive IDE work where you want sub-2s responses, stick with regular models (Sonnet 4, GPT-5 mini, Gemini Flash). For overnight agentic work where the model fixes 50 issues unattended, reasoning models earn their cost back in fewer wrong patches.

Methodology and sources: see About. Spotted a number that's out of date? Open an issue.

Get the weekly LLM digest

Big releases, leaderboard movements, price drops, and the one chart that actually mattered this week. No spam.

Or follow updates on GitHub.