Leaderboard · Guide · Updated
Best LLM for customer support in 2026
Ranked by what actually matters in support: tool-call reliability, latency, refusal rate on legitimate user requests, and per-conversation cost — not just leaderboard score.
OpenRouter routes GPT-5 mini, Claude Sonnet 4, Gemini 2.5 Flash, DeepSeek, Llama and 100+ other LLMs behind a single key — pay-as-you-go, no monthly minimum, automatic fallback if one provider is down (the kind of feature support teams actually need at 3am). Try OpenRouter → (affiliate · supports this site)
What "best for support" actually means
Customer support is not the same problem as coding or general chat. The headline benchmarks (SWE-Bench, MMLU-Pro, Arena Elo) only loosely predict whether a model will be a good support agent. The metrics that matter:
- Tool-call reliability — how often the model produces valid JSON args for your
lookup_order,refund,escalatetools. A 5% schema-violation rate at 100k conversations is 5,000 broken flows a month. - Refusal rate on legitimate requests — refusing to help an angry but legitimate customer is worse than helping wrongly. Models with high "I can't help with that" rates on edge cases are a poor fit for support.
- Per-conversation cost — support is output-heavy and high-volume. A frontier model at $15 / $75 vs a fast-tier at $0.25 / $2 is a 60× cost difference at scale.
- Latency / time-to-first-token — support users are already frustrated. Anything over ~1.5s feels broken. Reasoning models (o3, DeepSeek R1) are usually too slow.
- Long-context for policy + history — support agents need to read 5–20 policy docs plus full conversation history. 1M-token context simplifies retrieval architecture.
- Multilingual quality — most support queues are multilingual; the gap between top and middle models on Spanish, German, Japanese, Chinese, and Arabic matters.
TL;DR — best LLMs for customer support, ranked
| Use case | Pick | Why | $ in/out (per 1M) |
|---|---|---|---|
| Default for most teams | GPT-5 mini | Best $/quality, fast | $0.25 / $2 |
| High-stakes / complex tickets | Claude Sonnet 4 | Best tool-call reliability | $3 / $15 |
| Long-context (huge KB) | Gemini 2.5 Flash | 1M ctx, fast, cheap | $0.30 / $2.50 |
| Cheapest production-grade | Gemini 2.0 Flash | $0.001 / conversation | $0.10 / $0.40 |
| On-prem / compliance | Llama 3.3 70B | Self-host on 1× H100 | self-hosted |
| Mainland China deployment | Qwen2.5-72B | Best Chinese open model | $0.35 / $0.40 |
| Frontier / "white-glove" support | Claude Opus 4.1 | Highest quality, most expensive | $15 / $75 |
Most production support stacks use a fast-tier model for routing and a frontier model for escalations. OpenRouter lets you do both behind one key with automatic failover. Try OpenRouter → (affiliate)
Best LLM for support chatbots — fast, cheap, good enough
For a chatbot that answers FAQs, looks up order status, and routes anything complex to a human, you don't need GPT-5. You need a fast-tier model with reliable tool calls and rock-bottom output cost.
- GPT-5 mini (OpenAI) — 60.5% SWE-Bench, 80.1% MMLU-Pro, $0.25 / $2 per 1M tokens. The best price/quality ratio for support chatbots in 2026. Tool-call accuracy is ~90% on well-typed schemas. ~$0.0027 per 10-turn conversation at typical support length. The default pick.
- Gemini 2.5 Flash (Google) — 53.3% SWE-Bench, 79.0% MMLU-Pro, $0.30 / $2.50, 1M context. Slightly behind GPT-5 mini on tool-calling but the 1M context is a real win when your knowledge base is 200+ docs. Slightly cheaper on cached input.
- Gemini 2.0 Flash — $0.10 / $0.40, 1M context. The cheapest model that still handles structured output correctly most of the time. Right for very high-volume / low-stakes routing.
- Claude 3.5 Haiku — $0.80 / $4. More expensive than the above, but inherits Anthropic's tool-call reliability. Use if you've standardised on Claude family elsewhere.
Best LLM for ticket triage / agent assist
"Agent assist" is the human-in-the-loop pattern: model reads the ticket + history, suggests a draft response, retrieves relevant docs, optionally calls tools to look up account state. Quality matters more here than in pure chatbots — the human will read every output.
- Claude Sonnet 4 (Anthropic) — 72.7% SWE-Bench, 84.0% MMLU-Pro at $3 / $15. The best tool-call reliability outside of Opus, and Anthropic's prose quality means draft responses need less editing. The default for any team where the human reads the suggested reply before sending.
- GPT-5 (OpenAI) — 74.9% SWE-Bench, 86.8% MMLU-Pro at $1.25 / $10. Cheaper than Sonnet 4 with slightly higher knowledge benchmarks. Trade-off: slightly worse on long, multi-step tool flows where state has to be tracked across many turns.
- Gemini 2.5 Pro (Google) — 63.8% SWE-Bench at $1.25 / $10 with 2M context. Right when your tickets reference long contracts, full email threads, or multi-month conversation histories.
Best LLM for voice agents / IVR replacement
Voice support has its own constraints: latency budgets are tight (sub-1s feel responsive), and the model output is read aloud — verbosity is a UX problem, not just a cost one.
- Gemini 2.5 Flash — fastest first-token latency we measured among capable models, native streaming, $0.30 / $2.50. The default for production voice agents.
- GPT-5 mini — ~10–15% slower TTFT than Gemini Flash but better reasoning quality on edge cases. Right when your voice agent occasionally needs to handle complex disputes.
- GPT-4o mini — $0.15 / $0.60. Older but lower latency and integrates with OpenAI's Realtime voice API natively.
Best on-prem / open-weights LLMs for support
HIPAA, banking, government, EU data-residency rules — sometimes you can't send conversations to a US-hosted API. Open-weights models close most of the quality gap if you have the GPU budget.
- Llama 3.3 70B (Llama community licence) — fits on a single H100 in fp16. Strong general-purpose performance, well-documented support tooling, integrates with vLLM / TGI / SGLang. The practical default for self-hosted support deployments.
- Qwen2.5-72B (Qwen licence, open weights) — 71.1% MMLU-Pro at $0.35 / $0.40 hosted. Strongest open Chinese-language model — the right pick for support teams in mainland China where Western APIs aren't reachable.
- DeepSeek V3 (DeepSeek licence, open weights) — $0.27 / $1.10 hosted. 671B MoE with strong general-purpose capability. Fast on paid inference; expensive to self-host (needs 8× H100).
- Qwen2.5-Coder 32B (Apache-2.0) — fits on a single A100. The right choice if your support agents run a lot of code-aware lookups (developer support, infra alerts).
- Phi-4 (MIT licence) — 70.4% MMLU-Pro at $0.07 / $0.14 hosted. 16k context is the constraint; right for short tickets / FAQ routing where you don't need to ingest long policy docs.
Frontier picks — when "good enough" isn't
For B2B enterprise support, post-incident analysis, or legal/medical-adjacent customer ops where one bad answer is catastrophic, the frontier tier is justified despite the cost.
- Claude Opus 4.1 — 74.5% SWE-Bench, 87.0% MMLU-Pro at $15 / $75. Highest tool-call reliability, best long-form prose, lowest hallucination rate on policy questions. ~$0.13 per 10-turn conversation — only worth it for white-glove enterprise support.
- GPT-5 — slightly cheaper than Opus 4.1 ($1.25 / $10), 0.4 points higher on SWE-Bench. The cost-effective frontier pick.
- Grok 4 — lowest refusal rate at the frontier. Useful for support workloads in industries (gambling, adult content moderation, firearms retail) where Anthropic's and OpenAI's refusal rates create false-positive blocks on legitimate queries.
What to avoid for support
- Reasoning models (o3, DeepSeek R1) for live chat — first-token latency is 5–15 seconds. Right for offline ticket analysis, wrong for real-time support.
- 1.5-generation models (GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet) at full price — they're not meaningfully cheaper than current-gen options and hallucinate more on tool args. The newer fast-tier models are strictly better for support.
- Models with no function-calling — Phi-4 has it but in a non-OpenAI-compatible format; check before integrating.
- Locked-in voice solutions — bundled "all-in-one" support voice products lock you into their model. Build the LLM layer separately so you can swap when prices drop (which they have, every quarter, since 2024).
Cost reality check — per-conversation math
Assumptions: 10-turn conversation, ~6,000 input tokens (system prompt + retrieved KB chunks + conversation history), ~600 output tokens, no prompt caching. Numbers are rounded.
| Model | Per conversation | Per 100k conversations / month |
|---|---|---|
| Gemini 2.0 Flash | $0.0009 | $90 |
| GPT-4o mini | $0.0014 | $140 |
| GPT-5 mini | $0.0027 | $270 |
| Gemini 2.5 Flash | $0.0033 | $330 |
| DeepSeek V3 (hosted) | $0.0023 | $230 |
| Claude 3.5 Haiku | $0.0072 | $720 |
| GPT-5 | $0.0135 | $1,350 |
| Claude Sonnet 4 | $0.0270 | $2,700 |
| Claude Opus 4.1 | $0.1350 | $13,500 |
Run your own numbers with the API cost calculator — this is a back-of-envelope; real costs depend on your input/output ratio, prompt-caching strategy, and tool-call retry rate.
Implementation checklist
- Build a retrieval layer first. Even GPT-5 hallucinates policy answers without grounding. Vector store + structured policy chunks is non-negotiable.
- Define your tools narrowly. One tool per concrete action (
lookup_order(id), notdo_account_thing). Schema violation rates drop ~5× on tight schemas. - Set a hard escalation rule. "If model uncertainty is X or refusal triggered, hand off to human." Don't let the model improvise around its own limits.
- Run a fast-tier + frontier mix. Route simple intents to GPT-5 mini / Gemini Flash; escalate complex multi-step tickets to Sonnet 4 or Opus 4.1. Most teams over-spend by ~3× by sending everything to one tier.
- Cache aggressively. System prompts and policy docs barely change. Anthropic prompt caching cuts costs ~10× on repeated input; Gemini's implicit cache is automatic. OpenAI added structured caching in 2024.
- Log refusal triggers. Track which queries the model declines. ~30% of refusals on support workloads are false positives — surface them, don't hide them.
- Multi-provider failover from day 1. Provider outages happen. OpenRouter gives you automatic failover across 100+ models behind one key.
Frequently asked questions
What is the best LLM for customer support in 2026?
For most teams the right default is GPT-5 mini at $0.25 / $2 per 1M tokens — 60.5% SWE-Bench (good enough for tool-calling), 80.1% MMLU-Pro for knowledge lookups, ~12× cheaper than Claude Sonnet 4. Use Claude Sonnet 4 ($3 / $15) when you need the most reliable tool-call schemas on complex multi-step ticket workflows, and Gemini 2.5 Flash ($0.30 / $2.50) when you need 1M-token context to feed in long policy documents.
What is the cheapest LLM that's good enough for customer support?
Gemini 2.0 Flash at $0.10 / $0.40 per 1M tokens is the cheapest production-grade option with 1M context. Phi-4 (MIT licence) at $0.07 / $0.14 hosted is even cheaper but has a tighter 16k context. For knowledge-base chatbots that mostly route, summarise, and answer simple FAQs, both deliver acceptable quality at <$0.001 per typical conversation.
Which LLM has the best tool-calling for customer support agents?
Claude Sonnet 4 and Claude Opus 4.1 have the lowest function-call schema-violation rate in our tests, which translates directly to fewer retries and broken multi-turn flows. GPT-5 is statistically tied on capability but produces malformed args ~2-3× more often on edge cases. For mission-critical ticket-resolution agents, the Claude family is the safe default.
Can I run a customer support LLM on-prem?
Yes — for compliance-sensitive workloads (HIPAA, banking, government) Llama 3.3 70B (Llama community licence) runs on a single H100 in fp16 and gives you a self-hosted support model. Qwen2.5-72B (Qwen licence) is the strongest open Chinese model and the practical default for support teams operating in mainland China.
How much does an LLM customer support agent cost per conversation?
For a typical 10-turn support conversation with ~6k input tokens and ~600 output: GPT-5 mini ≈ $0.0027, Gemini 2.5 Flash ≈ $0.0033, Claude Sonnet 4 ≈ $0.027, GPT-5 ≈ $0.014, Gemini 2.0 Flash ≈ $0.0009. At 100k conversations/month, the difference between Sonnet 4 ($2,700) and GPT-5 mini ($270) is real money.
Should I use a reasoning model like o3 or DeepSeek R1 for support?
No, not for live chat. Reasoning models have 5–15 second first-token latency, which feels broken in a chat UI. They're useful for offline support tasks: post-conversation summarisation, complex root-cause analysis, training-data quality review. For real-time support stick with non-reasoning models.
How do I prevent the LLM from refusing legitimate customer queries?
(1) Use a model with a lower baseline refusal rate — Grok 4 is the lowest at the frontier; GPT-5 mini is significantly more permissive than Claude on edge cases. (2) Frame the system prompt with the role explicit: "You are a support agent; the user is an authenticated customer of [company]." (3) Provide an explicit allow-list of topics you support. (4) Always have human-handoff as the fallback rather than letting the model invent boundaries.
Methodology and sources: see About. Spotted a number that's out of date? Open an issue.
Get the weekly LLM digest
Big releases, leaderboard movements, price drops, and the one chart that actually mattered this week. No spam.
Or follow updates on GitHub.