MathArena Leaderboard

Name: MathArena Leaderboard
Creator: ETH Zürich SRI Lab (Secure, Reliable & Intelligent Systems)

ETH SRI's leaderboard for evaluating LLMs on uncontaminated, freshly-released math competitions (IMO, USAMO, Putnam, AIME, HMMT, IMC, etc.).

Open

Operator: ETH Zürich SRI Lab (Secure, Reliable & Intelligent Systems)
Kind: Aggregated
Updates: monthly·updated 17d ago
Notable for: The reference uncontaminated math leaderboard — the only one where Gemini Deep Think and GPT-5 olympiad scores are taken seriously because problems are evaluated within days of release.
URL: matharena.ai
Tracks: 1 evals · aggregated

Cite

Notes

Only stored in your browser.

Attribution

Scores: matharena.ai

Attribution policy →

Per-eval breakdown

172

models

Model	↗
o3 OpenAI	96.7%	96.7%
GPT-5 OpenAI	94.6%	94.6%
Grok 4 xAI	94.3%	94.3%
o4 Mini OpenAI	94.0%	94.0%
Qwen3 235B A22B Thinking 2507 Alibaba	94.0%	94.0%
Grok 3 mini xAI	93.3%	93.3%
Qwen3 30B A3B 2507 (Reasoning) Alibaba	90.7%	90.7%
Gemini 2.5 Pro Google (Alphabet Inc.)	88.7%	88.7%
GLM 4.5 Zai	87.3%	87.3%
Gemini 2.5 Pro Preview (Mar' 25) Google (Alphabet Inc.)	87.0%	87.0%
o3 Mini OpenAI	86.0%	86.0%
Qwen3-235B-A22B Alibaba Qwen (Tongyi Qianwen)	85.7%	85.7%
MiniMax M1 80k Minimax	84.7%	84.7%
Gemini 2.5 Flash Preview (Reasoning) Google (Alphabet Inc.)	84.3%	84.3%
Gemini 2.5 Pro Preview (May' 25) Google (Alphabet Inc.)	84.3%	84.3%
Hermes 4 (405B) Nous Research	81.9%	81.9%
MiniMax M1 40k Minimax	81.3%	81.3%
R1 DeepSeek	79.8%	79.8%
QwQ 32B Alibaba	78.0%	78.0%
Llama 3.1 Nemotron Ultra 253B v1 (Reasoning) NVIDIA	74.7%	74.7%
Qwen3 30B A3B Instruct 2507 Alibaba	72.7%	72.7%
o1 OpenAI	72.3%	72.3%
Qwen3.235B A22b Instruct 2507 Alibaba	71.7%	71.7%
Magistral Small 1 Mistral AI	71.3%	71.3%
Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning) NVIDIA	70.7%	70.7%
Magistral Medium 1 Mistral AI	70.0%	70.0%
Kimi K2 0711 Moonshot AI	69.3%	69.3%
Solar Pro 2 (Reasoning) Upstage	69.0%	69.0%
R1 Distill Qwen 32B DeepSeek	68.7%	68.7%
GLM 4.5 Air Zai	67.3%	67.3%
R1 Distill Llama 70B DeepSeek	67.0%	67.0%
DeepSeek R1 Distill Qwen 14B DeepSeek	66.7%	66.7%
Solar Pro 2 (Preview) (Reasoning) Upstage	66.3%	66.3%
DeepSeek R1 0528 Qwen3 8B DeepSeek	65.0%	65.0%
o1 Mini OpenAI	60.3%	60.3%
Claude 4 Opus Anthropic	56.3%	56.3%
DeepSeek V3 0324 DeepSeek	52.0%	52.0%
Reka Flash 3 Reka AI	51.0%	51.0%
Gemini 2.0 Flash Thinking Experimental (Jan '25) Google (Alphabet Inc.)	50.0%	50.0%
Gemini 2.5 Flash Google (Alphabet Inc.)	50.0%	50.0%
Gemini 2.5 Flash Lite Google (Alphabet Inc.)	50.0%	50.0%
ERNIE 4.5 300B A47B Baidu	49.3%	49.3%
Sonar Perplexity AI	48.7%	48.7%
Qwen3Coder 480B A35b Instruct Alibaba	47.7%	47.7%
EXAONE 4.0 32B LG AI Research	47.0%	47.0%
QwQ 32B-Preview Alibaba	45.3%	45.3%
Mistral Medium 3 Mistral AI	44.0%	44.0%
GPT-4.1 OpenAI	43.7%	43.7%
Gemini 2.5 Flash Preview (Non-reasoning) Google (Alphabet Inc.)	43.3%	43.3%
GPT-4.1 Mini OpenAI	43.0%	43.0%

172 / 172 models

Evals tracked

AIME 2024: Problems from the American Invitational Mathematics Examination