Open LLM Leaderboard

Name: Open LLM Leaderboard
Creator: Hugging Face

Hugging Face's automated leaderboard running a fixed evaluation harness across thousands of open-weight LLMs, reporting per-task and aggregate scores.

Open

Operator: Hugging Face
Kind: Aggregated
Updates: live·updated 16d ago
Notable for: The dominant public ranking of open-weight LLMs; running it requires no API and surfaces small / specialty models the closed-API leaderboards ignore.
URL: huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
Tracks: 7 evals · aggregated

Cite

Notes

Only stored in your browser.

Attribution

Scores: huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

Attribution policy →

Per-eval breakdown

347

models

Model	↗	↗	↗	↗	↗	↗	↗
Gemini 3.1 Pro Preview Google (Alphabet Inc.)	-	93.2%	-	-	-	-	-	93.2%
Gemini 3 Pro Google (Alphabet Inc.)	89.8%	-	-	-	-	-	-	89.8%
Gemini 3 Pro Preview Google (Alphabet Inc.)	89.5%	-	-	-	-	-	-	89.5%
Gemini 3 Flash Preview Google (Alphabet Inc.)	89.0%	88.4%	-	-	-	-	-	88.7%
GPT-5.5 OpenAI	88.6%	-	-	-	-	-	-	88.6%
GPT-5 OpenAI	82.0%	83.3%	100.0%	-	-	-	-	88.4%
Gemini 3 Flash Google (Alphabet Inc.)	88.2%	-	-	-	-	-	-	88.2%
Claude 4.1 Opus Anthropic	88.0%	-	-	-	-	-	-	88.0%
MiniMax M2.1 Minimax	87.5%	-	-	-	-	-	-	87.5%
Qwen3.5 397B A17B Alibaba	87.3%	-	-	-	-	-	-	87.3%
GPT-5.4 OpenAI	-	87.2%	-	-	-	-	-	87.2%
GPT-4.1 Mini OpenAI	78.1%	-	100.0%	83.5%	-	-	-	87.2%
Claude Opus 4.5 Anthropic	88.9%	84.7%	-	-	-	-	-	86.8%
Grok 4 xAI	86.6%	-	-	-	-	-	-	86.6%
GPT-5 Codex OpenAI	86.5%	-	-	-	-	-	-	86.5%
DeepSeek V3.2 Speciale DeepSeek	86.3%	-	-	-	-	-	-	86.3%
Gemini 2.5 Pro Google (Alphabet Inc.)	86.2%	-	-	-	-	-	-	86.2%
Claude 4 Opus Anthropic	86.0%	-	-	-	-	-	-	86.0%
GPT-5.1-Codex OpenAI	86.0%	-	-	-	-	-	-	86.0%
Gemini 2.5 Pro Preview (Mar' 25) Google (Alphabet Inc.)	85.8%	-	-	-	-	-	-	85.8%
Doubao Seed Code ByteDance Seed	85.4%	-	-	-	-	-	-	85.4%
GLM 5.1 Zai	85.4%	-	-	-	-	-	-	85.4%
o3 OpenAI	85.3%	-	-	-	-	-	-	85.3%
DeepSeek V3.1 DeepSeek	85.1%	-	-	-	-	-	-	85.1%
MiMo-V2.5-Pro Xiaomi	85.1%	-	-	-	-	-	-	85.1%
Claude Sonnet 4.5 Anthropic	86.0%	83.9%	-	-	-	-	-	85.0%
Kimi K2 Thinking Kimi	84.8%	-	-	-	-	-	-	84.8%
MiniMax M2.5 Minimax	-	84.5%	-	-	-	-	-	84.5%
R1 DeepSeek	84.4%	-	-	-	-	-	-	84.4%
Qwen3 235B A22B Thinking 2507 Alibaba	84.3%	-	-	-	-	-	-	84.3%
Gemini 2.5 Flash Preview (Sep '25) (Reasoning) Google (Alphabet Inc.)	84.2%	-	-	-	-	-	-	84.2%
o1 OpenAI	84.1%	-	-	-	-	-	-	84.1%
Qwen3 Max Alibaba	84.1%	-	-	-	-	-	-	84.1%
Qwen3 Max (Preview) Alibaba	83.8%	-	-	-	-	-	-	83.8%
DeepSeek V3.2 DeepSeek	83.7%	-	-	-	-	-	-	83.7%
Gemini 2.5 Pro Preview (May' 25) Google (Alphabet Inc.)	83.7%	-	-	-	-	-	-	83.7%
DeepSeek V3.1 Terminus DeepSeek	83.6%	-	-	-	-	-	-	83.6%
DeepSeek V3.2 Exp DeepSeek	83.6%	-	-	-	-	-	-	83.6%
Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning) Google (Alphabet Inc.)	83.6%	-	-	-	-	-	-	83.6%
Qwen3 VL 235B A22B Thinking Alibaba	83.6%	-	-	-	-	-	-	83.6%
GLM 4.5 Zai	83.5%	-	-	-	-	-	-	83.5%
o4 Mini OpenAI	83.2%	-	-	-	-	-	-	83.2%
ERNIE 5.0 Thinking Preview Baidu	83.0%	-	-	-	-	-	-	83.0%
Grok 3 mini xAI	82.8%	-	-	-	-	-	-	82.8%
Qwen3.235B A22b Instruct 2507 Alibaba	82.8%	-	-	-	-	-	-	82.8%
Llama 3.1 Nemotron Ultra 253B v1 (Reasoning) NVIDIA	82.5%	-	-	-	-	-	-	82.5%
Kimi K2 0711 Moonshot AI	82.4%	-	-	-	-	-	-	82.4%
Qwen3 Max Thinking (Preview) Alibaba	82.4%	-	-	-	-	-	-	82.4%
Qwen3 Next 80B A3B Thinking Alibaba	82.4%	-	-	-	-	-	-	82.4%
Qwen3 VL 235B A22B Instruct Alibaba	82.3%	-	-	-	-	-	-	82.3%

347 / 347 models

Evals tracked