0

MathArena Leaderboard

ETH SRI's leaderboard for evaluating LLMs on uncontaminated, freshly-released math competitions (IMO, USAMO, Putnam, AIME, HMMT, IMC, etc.).

Kind
Aggregated
Updates
monthly·updated 3d ago
Notable for
The reference uncontaminated math leaderboard — the only one where Gemini Deep Think and GPT-5 olympiad scores are taken seriously because problems are evaluated within days of release.
Tracks
1 evals · aggregated

Cite

Notes

Only stored in your browser.

Per-eval breakdown

172

models

Model
o3

OpenAI

96.7%96.7%
GPT-5

OpenAI

94.6%94.6%
Grok 4

xAI

94.3%94.3%
o4 Mini

OpenAI

94.0%94.0%
Qwen3 235B A22B Thinking 2507

Alibaba

94.0%94.0%
Grok 3 mini

xAI

93.3%93.3%
Qwen3 30B A3B 2507 (Reasoning)

Alibaba

90.7%90.7%
Gemini 2.5 Pro

Google (Alphabet Inc.)

88.7%88.7%
GLM 4.5

Zai

87.3%87.3%
Gemini 2.5 Pro Preview (Mar' 25)

Google (Alphabet Inc.)

87.0%87.0%
o3 Mini

OpenAI

86.0%86.0%
Qwen3-235B-A22B

Alibaba Qwen (Tongyi Qianwen)

85.7%85.7%
MiniMax M1 80k

Minimax

84.7%84.7%
Gemini 2.5 Flash Preview (Reasoning)

Google (Alphabet Inc.)

84.3%84.3%
Gemini 2.5 Pro Preview (May' 25)

Google (Alphabet Inc.)

84.3%84.3%
Hermes 4 (405B)

Nous Research

81.9%81.9%
MiniMax M1 40k

Minimax

81.3%81.3%
R1

DeepSeek

79.8%79.8%
QwQ 32B

Alibaba

78.0%78.0%
Llama 3.1 Nemotron Ultra 253B v1 (Reasoning)

NVIDIA

74.7%74.7%
Qwen3 30B A3B Instruct 2507

Alibaba

72.7%72.7%
o1

OpenAI

72.3%72.3%
Qwen3.235B A22b Instruct 2507

Alibaba

71.7%71.7%
Magistral Small 1

Mistral AI

71.3%71.3%
Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning)

NVIDIA

70.7%70.7%
Magistral Medium 1

Mistral AI

70.0%70.0%
Kimi K2 0711

Moonshot AI

69.3%69.3%
Solar Pro 2 (Reasoning)

Upstage

69.0%69.0%
R1 Distill Qwen 32B

DeepSeek

68.7%68.7%
GLM 4.5 Air

Zai

67.3%67.3%
R1 Distill Llama 70B

DeepSeek

67.0%67.0%
DeepSeek R1 Distill Qwen 14B

DeepSeek

66.7%66.7%
Solar Pro 2 (Preview) (Reasoning)

Upstage

66.3%66.3%
DeepSeek R1 0528 Qwen3 8B

DeepSeek

65.0%65.0%
o1 Mini

OpenAI

60.3%60.3%
Claude 4 Opus

Anthropic

56.3%56.3%
DeepSeek V3 0324

DeepSeek

52.0%52.0%
Reka Flash 3

Reka AI

51.0%51.0%
Gemini 2.0 Flash Thinking Experimental (Jan '25)

Google (Alphabet Inc.)

50.0%50.0%
Gemini 2.5 Flash

Google (Alphabet Inc.)

50.0%50.0%
Gemini 2.5 Flash Lite

Google (Alphabet Inc.)

50.0%50.0%
ERNIE 4.5 300B A47B

Baidu

49.3%49.3%
Sonar

Perplexity AI

48.7%48.7%
Qwen3Coder 480B A35b Instruct

Alibaba

47.7%47.7%
EXAONE 4.0 32B

LG AI Research

47.0%47.0%
QwQ 32B-Preview

Alibaba

45.3%45.3%
Mistral Medium 3

Mistral AI

44.0%44.0%
GPT-4.1

OpenAI

43.7%43.7%
Gemini 2.5 Flash Preview (Non-reasoning)

Google (Alphabet Inc.)

43.3%43.3%
GPT-4.1 Mini

OpenAI

43.0%43.0%
172 / 172 models

Evals tracked

1