Best LLMs — 2026 Rankings

LLM Leaderboard

The definitive ranking of every major LLM — open and closed source — compared across reasoning, coding, math, agentic, software engineering, and chat benchmarks.

Roshan Desai

Roshan Desai · Last updated: 2026-03-12

S

Claude Opus 4.6

N/A

GPT-5.4

N/A

GLM-5

744B

Kimi K2.5

1T

DeepSeek V3.2

685B

A

Claude Sonnet 4.6

N/A

Gemini 3.1 Pro

N/A

Qwen 3.5

397B

DeepSeek R1

671B

Mistral Large

675B

MiniMax M2.5

230B

Step-3.5-Flash

196B

MiMo-V2-Flash

309B

B

GPT-oss 120B

117B

Nemotron Ultra 253B

253B

C

Grok 3

N/A

DeepSeek V3

671B

Llama 4 Maverick

400B

D

Best LLMs by Task — Benchmark Rankings

Which LLM is best for coding, reasoning, or agentic tasks? See how every model stacks up across key benchmarks — hover any bar for details.

Best Overall (MMLU)

General knowledge across 57 subjects (MMLU)

Best Multilingual

Multilingual Q&A across languages (MMMLU)

Best Visual Reasoning

Visual reasoning across disciplines (MMMU-Pro)

Hardest Exam

Expert-level multidisciplinary reasoning (Humanity's Last Exam)

LLM Benchmark Scores & Pricing

Complete benchmark results and pricing for every major LLM. Click any column header to sort and rank.

Filter:

Claude Opus 4.6

Anthropic

N/A

200K

$15.00

$75.00

82.0

91.3

94.0

1503

80.8

95.0

76.0

100.0

97.6

91.0

91.1

77.3

53.0

65.4

68.8

91.9

72.7

84.0

Claude Sonnet 4.6

Anthropic

N/A

200K

$3.00

$15.00

79.1

89.9

N/A

1460

79.6

92.1

72.4

52.8

97.8

N/A

89.3

75.6

49.0

59.1

58.3

91.7

72.5

74.7

DeepSeek R1

DeepSeek

671B

128K

$0.28

$0.42

84.0

71.5

83.3

1398

49.2

90.2

65.9

87.5

97.3

90.8

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

DeepSeek V3

DeepSeek

671B

128K

$0.28

$1.10

81.2

68.4

N/A

1359

38.8

N/A

49.2

N/A

94.0

88.5

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

DeepSeek V3.2

DeepSeek

685B

130K

$0.28

$0.42

85.0

79.9

N/A

1423

67.8

N/A

74.1

89.3

N/A

88.5

N/A

N/A

N/A

39.6

N/A

N/A

N/A

N/A

Gemini 3.1 Pro

Google

N/A

1M

$2.00

$12.00

85.0

91.9

85.0

1492

78.0

93.0

81.3

100.0

94.0

91.8

91.8

81.0

45.8

56.2

31.1

85.3

N/A

59.2

GLM-5

Zhipu AI

744B

200K

N/A

N/A

70.4

86.0

88.0

1454

77.8

90.0

52.0

84.0

88.0

85.0

N/A

N/A

50.4

56.2

N/A

89.7

N/A

75.9

GPT-5.4

OpenAI

N/A

1M

$2.50

$15.00

N/A

92.8

N/A

1463

N/A

N/A

N/A

N/A

N/A

N/A

N/A

81.2

N/A

75.1

N/A

N/A

75.0

82.7

GPT-oss 120B

OpenAI

117B

128K

N/A

N/A

90.0

80.9

N/A

1355

62.4

88.3

60.0

97.9

N/A

90.0

N/A

N/A

N/A

18.7

N/A

N/A

N/A

N/A

Grok 3

xAI

N/A

131K

$3.00

$15.00

N/A

84.6

N/A

1412

49.0

94.5

79.4

93.3

N/A

N/A

N/A

N/A

N/A

52.0

N/A

N/A

N/A

N/A

Kimi K2.5

Moonshot

1T

262K

N/A

N/A

87.1

87.6

94.0

1438

76.8

99.0

85.0

96.1

98.0

92.0

N/A

78.5

N/A

50.8

N/A

N/A

N/A

N/A

Llama 4 Maverick

Meta

400B

1M

N/A

N/A

80.5

69.8

N/A

1328

N/A

62.0

43.4

N/A

N/A

85.5

84.6

N/A

N/A

N/A

N/A

N/A

N/A

N/A

MiMo-V2-Flash

Xiaomi

309B

262K

N/A

N/A

84.9

83.7

N/A

1393

73.4

84.8

80.6

94.1

N/A

86.7

N/A

N/A

N/A

38.5

N/A

N/A

N/A

N/A

MiniMax M2.5

MiniMax

230B

205K

$0.30

$1.20

76.5

85.2

87.5

1404

80.2

89.6

65.0

86.3

N/A

85.0

N/A

N/A

N/A

42.2

N/A

N/A

N/A

N/A

Mistral Large

Mistral

675B

256K

N/A

N/A

N/A

43.9

N/A

1416

N/A

92.0

82.8

88.0

93.6

85.5

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

Nemotron Ultra 253B

Nvidia

253B

128K

N/A

N/A

N/A

76.0

89.5

1348

N/A

N/A

66.3

72.5

97.0

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

Qwen 3.5

Qwen

397B

262K

N/A

N/A

87.8

88.4

92.6

1450

76.4

N/A

83.6

N/A

N/A

88.5

88.5

79.0

28.7

52.5

N/A

86.7

62.2

78.6

Step-3.5-Flash

Stepfun

196B

262K

$0.10

$0.30

85.8

N/A

N/A

1389

74.4

81.1

86.4

99.8

N/A

N/A

N/A

N/A

N/A

51.0

N/A

N/A

N/A

N/A

Compare LLMs Head-to-Head

Select two models to see how they stack up across all benchmarks.

Model A

Model B

GPT-5.4

Claude Opus 4.6

GPQA Diamond

92.8

vs

91.3

Chatbot Arena

1463

vs

1503

MMMU-Pro

81.2

vs

77.3

Terminal-Bench 2.0

75.1

vs

65.4

OSWorld

75.0

vs

72.7

BrowseComp

82.7

vs

84.0

Benchmarks won

4

vs

2

Try These Models in Onyx

Onyx is the open-source AI platform that lets you connect any of these LLMs to your team's docs, apps, and people.