Best LLMs — 2026 Rankings
LLM Leaderboard
The definitive ranking of every major LLM — open and closed source — compared across reasoning, coding, math, agentic, software engineering, and chat benchmarks.
Roshan Desai · Last updated: 2026-03-12
S
Claude Opus 4.6
N/A
GPT-5.4
N/A
GLM-5
744B
Kimi K2.5
1T
DeepSeek V3.2
685B
A
Claude Sonnet 4.6
N/A
Gemini 3.1 Pro
N/A
Qwen 3.5
397B
DeepSeek R1
671B
Mistral Large
675B
MiniMax M2.5
230B
Step-3.5-Flash
196B
MiMo-V2-Flash
309B
B
GPT-oss 120B
117B
Nemotron Ultra 253B
253B
C
Grok 3
N/A
DeepSeek V3
671B
Llama 4 Maverick
400B
D
Best LLMs by Task — Benchmark Rankings
Which LLM is best for coding, reasoning, or agentic tasks? See how every model stacks up across key benchmarks — hover any bar for details.
Best Overall (MMLU)
General knowledge across 57 subjects (MMLU)
Best Multilingual
Multilingual Q&A across languages (MMMLU)
Best Visual Reasoning
Visual reasoning across disciplines (MMMU-Pro)
Hardest Exam
Expert-level multidisciplinary reasoning (Humanity's Last Exam)
LLM Benchmark Scores & Pricing
Complete benchmark results and pricing for every major LLM. Click any column header to sort and rank.
Filter:
Claude Opus 4.6 Anthropic | N/A | 200K | $15.00 | $75.00 | 82.0 | 91.3 | 94.0 | 1503 | 80.8 | 95.0 | 76.0 | 100.0 | 97.6 | 91.0 | 91.1 | 77.3 | 53.0 | 65.4 | 68.8 | 91.9 | 72.7 | 84.0 |
Claude Sonnet 4.6 Anthropic | N/A | 200K | $3.00 | $15.00 | 79.1 | 89.9 | N/A | 1460 | 79.6 | 92.1 | 72.4 | 52.8 | 97.8 | N/A | 89.3 | 75.6 | 49.0 | 59.1 | 58.3 | 91.7 | 72.5 | 74.7 |
DeepSeek R1 DeepSeek | 671B | 128K | $0.28 | $0.42 | 84.0 | 71.5 | 83.3 | 1398 | 49.2 | 90.2 | 65.9 | 87.5 | 97.3 | 90.8 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
DeepSeek V3 DeepSeek | 671B | 128K | $0.28 | $1.10 | 81.2 | 68.4 | N/A | 1359 | 38.8 | N/A | 49.2 | N/A | 94.0 | 88.5 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
DeepSeek V3.2 DeepSeek | 685B | 130K | $0.28 | $0.42 | 85.0 | 79.9 | N/A | 1423 | 67.8 | N/A | 74.1 | 89.3 | N/A | 88.5 | N/A | N/A | N/A | 39.6 | N/A | N/A | N/A | N/A |
Gemini 3.1 Pro | N/A | 1M | $2.00 | $12.00 | 85.0 | 91.9 | 85.0 | 1492 | 78.0 | 93.0 | 81.3 | 100.0 | 94.0 | 91.8 | 91.8 | 81.0 | 45.8 | 56.2 | 31.1 | 85.3 | N/A | 59.2 |
GLM-5 Zhipu AI | 744B | 200K | N/A | N/A | 70.4 | 86.0 | 88.0 | 1454 | 77.8 | 90.0 | 52.0 | 84.0 | 88.0 | 85.0 | N/A | N/A | 50.4 | 56.2 | N/A | 89.7 | N/A | 75.9 |
GPT-5.4 OpenAI | N/A | 1M | $2.50 | $15.00 | N/A | 92.8 | N/A | 1463 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | 81.2 | N/A | 75.1 | N/A | N/A | 75.0 | 82.7 |
GPT-oss 120B OpenAI | 117B | 128K | N/A | N/A | 90.0 | 80.9 | N/A | 1355 | 62.4 | 88.3 | 60.0 | 97.9 | N/A | 90.0 | N/A | N/A | N/A | 18.7 | N/A | N/A | N/A | N/A |
Grok 3 xAI | N/A | 131K | $3.00 | $15.00 | N/A | 84.6 | N/A | 1412 | 49.0 | 94.5 | 79.4 | 93.3 | N/A | N/A | N/A | N/A | N/A | 52.0 | N/A | N/A | N/A | N/A |
Kimi K2.5 Moonshot | 1T | 262K | N/A | N/A | 87.1 | 87.6 | 94.0 | 1438 | 76.8 | 99.0 | 85.0 | 96.1 | 98.0 | 92.0 | N/A | 78.5 | N/A | 50.8 | N/A | N/A | N/A | N/A |
Llama 4 Maverick Meta | 400B | 1M | N/A | N/A | 80.5 | 69.8 | N/A | 1328 | N/A | 62.0 | 43.4 | N/A | N/A | 85.5 | 84.6 | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
MiMo-V2-Flash Xiaomi | 309B | 262K | N/A | N/A | 84.9 | 83.7 | N/A | 1393 | 73.4 | 84.8 | 80.6 | 94.1 | N/A | 86.7 | N/A | N/A | N/A | 38.5 | N/A | N/A | N/A | N/A |
MiniMax M2.5 MiniMax | 230B | 205K | $0.30 | $1.20 | 76.5 | 85.2 | 87.5 | 1404 | 80.2 | 89.6 | 65.0 | 86.3 | N/A | 85.0 | N/A | N/A | N/A | 42.2 | N/A | N/A | N/A | N/A |
Mistral Large Mistral | 675B | 256K | N/A | N/A | N/A | 43.9 | N/A | 1416 | N/A | 92.0 | 82.8 | 88.0 | 93.6 | 85.5 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
Nemotron Ultra 253B Nvidia | 253B | 128K | N/A | N/A | N/A | 76.0 | 89.5 | 1348 | N/A | N/A | 66.3 | 72.5 | 97.0 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
Qwen 3.5 Qwen | 397B | 262K | N/A | N/A | 87.8 | 88.4 | 92.6 | 1450 | 76.4 | N/A | 83.6 | N/A | N/A | 88.5 | 88.5 | 79.0 | 28.7 | 52.5 | N/A | 86.7 | 62.2 | 78.6 |
Step-3.5-Flash Stepfun | 196B | 262K | $0.10 | $0.30 | 85.8 | N/A | N/A | 1389 | 74.4 | 81.1 | 86.4 | 99.8 | N/A | N/A | N/A | N/A | N/A | 51.0 | N/A | N/A | N/A | N/A |
Compare LLMs Head-to-Head
Select two models to see how they stack up across all benchmarks.
Model A
Model B
GPT-5.4
Claude Opus 4.6
GPQA Diamond
92.8
vs
91.3
Chatbot Arena
1463
vs
1503
MMMU-Pro
81.2
vs
77.3
Terminal-Bench 2.0
75.1
vs
65.4
OSWorld
75.0
vs
72.7
BrowseComp
82.7
vs
84.0
Benchmarks won
4
vs
2
Try These Models in Onyx
Onyx is the open-source AI platform that lets you connect any of these LLMs to your team's docs, apps, and people.