0

Open LLM Leaderboard

Hugging Face's automated leaderboard running a fixed evaluation harness across thousands of open-weight LLMs, reporting per-task and aggregate scores.

Operator
Hugging Face
Kind
Aggregated
Updates
live·updated 9h ago
Notable for
The dominant public ranking of open-weight LLMs; running it requires no API and surfaces small / specialty models the closed-API leaderboards ignore.
Tracks
7 evals · aggregated

Cite

Notes

Only stored in your browser.

Per-eval breakdown

347

models

Model
Gemini 3.1 Pro Preview

Google (Alphabet Inc.)

-93.2%-----93.2%
Gemini 3 Pro

Google (Alphabet Inc.)

89.8%------89.8%
Gemini 3 Pro Preview

Google (Alphabet Inc.)

89.5%------89.5%
Gemini 3 Flash Preview

Google (Alphabet Inc.)

89.0%88.4%-----88.7%
GPT-5.5

OpenAI

88.6%------88.6%
GPT-5

OpenAI

82.0%83.3%100.0%----88.4%
Gemini 3 Flash

Google (Alphabet Inc.)

88.2%------88.2%
Claude 4.1 Opus

Anthropic

88.0%------88.0%
MiniMax M2.1

Minimax

87.5%------87.5%
Qwen3.5 397B A17B

Alibaba

87.3%------87.3%
GPT-5.4

OpenAI

-87.2%-----87.2%
GPT-4.1 Mini

OpenAI

78.1%-100.0%83.5%---87.2%
Claude Opus 4.5

Anthropic

88.9%84.7%-----86.8%
Grok 4

xAI

86.6%------86.6%
GPT-5 Codex

OpenAI

86.5%------86.5%
DeepSeek V3.2 Speciale

DeepSeek

86.3%------86.3%
Gemini 2.5 Pro

Google (Alphabet Inc.)

86.2%------86.2%
Claude 4 Opus

Anthropic

86.0%------86.0%
GPT-5.1-Codex

OpenAI

86.0%------86.0%
Gemini 2.5 Pro Preview (Mar' 25)

Google (Alphabet Inc.)

85.8%------85.8%
Doubao Seed Code

ByteDance Seed

85.4%------85.4%
GLM 5.1

Zai

85.4%------85.4%
o3

OpenAI

85.3%------85.3%
DeepSeek V3.1

DeepSeek

85.1%------85.1%
MiMo-V2.5-Pro

Xiaomi

85.1%------85.1%
Claude Sonnet 4.5

Anthropic

86.0%83.9%-----85.0%
Kimi K2 Thinking

Kimi

84.8%------84.8%
MiniMax M2.5

Minimax

-84.5%-----84.5%
R1

DeepSeek

84.4%------84.4%
Qwen3 235B A22B Thinking 2507

Alibaba

84.3%------84.3%
Gemini 2.5 Flash Preview (Sep '25) (Reasoning)

Google (Alphabet Inc.)

84.2%------84.2%
o1

OpenAI

84.1%------84.1%
Qwen3 Max

Alibaba

84.1%------84.1%
Qwen3 Max (Preview)

Alibaba

83.8%------83.8%
DeepSeek V3.2

DeepSeek

83.7%------83.7%
Gemini 2.5 Pro Preview (May' 25)

Google (Alphabet Inc.)

83.7%------83.7%
DeepSeek V3.1 Terminus

DeepSeek

83.6%------83.6%
DeepSeek V3.2 Exp

DeepSeek

83.6%------83.6%
Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning)

Google (Alphabet Inc.)

83.6%------83.6%
Qwen3 VL 235B A22B Thinking

Alibaba

83.6%------83.6%
GLM 4.5

Zai

83.5%------83.5%
o4 Mini

OpenAI

83.2%------83.2%
ERNIE 5.0 Thinking Preview

Baidu

83.0%------83.0%
Grok 3 mini

xAI

82.8%------82.8%
Qwen3.235B A22b Instruct 2507

Alibaba

82.8%------82.8%
Llama 3.1 Nemotron Ultra 253B v1 (Reasoning)

NVIDIA

82.5%------82.5%
Kimi K2 0711

Moonshot AI

82.4%------82.4%
Qwen3 Max Thinking (Preview)

Alibaba

82.4%------82.4%
Qwen3 Next 80B A3B Thinking

Alibaba

82.4%------82.4%
Qwen3 VL 235B A22B Instruct

Alibaba

82.3%------82.3%
347 / 347 models

Evals tracked

7