MathArena Leaderboard
ETH SRI's leaderboard for evaluating LLMs on uncontaminated, freshly-released math competitions (IMO, USAMO, Putnam, AIME, HMMT, IMC, etc.).
- Kind
- Aggregated
- Updates
- monthly·updated 3d ago
- Notable for
- The reference uncontaminated math leaderboard — the only one where Gemini Deep Think and GPT-5 olympiad scores are taken seriously because problems are evaluated within days of release.
- URL
- matharena.ai
- Tracks
- 1 evals · aggregated
Cite
Notes
Only stored in your browser.
Per-eval breakdown
172models
| Model | ||
|---|---|---|
| o3 OpenAI | 96.7% | 96.7% |
| GPT-5 OpenAI | 94.6% | 94.6% |
| Grok 4 xAI | 94.3% | 94.3% |
| o4 Mini OpenAI | 94.0% | 94.0% |
| Qwen3 235B A22B Thinking 2507 Alibaba | 94.0% | 94.0% |
| Grok 3 mini xAI | 93.3% | 93.3% |
| Qwen3 30B A3B 2507 (Reasoning) Alibaba | 90.7% | 90.7% |
| Gemini 2.5 Pro Google (Alphabet Inc.) | 88.7% | 88.7% |
| GLM 4.5 Zai | 87.3% | 87.3% |
| Gemini 2.5 Pro Preview (Mar' 25) Google (Alphabet Inc.) | 87.0% | 87.0% |
| o3 Mini OpenAI | 86.0% | 86.0% |
| Qwen3-235B-A22B Alibaba Qwen (Tongyi Qianwen) | 85.7% | 85.7% |
| MiniMax M1 80k Minimax | 84.7% | 84.7% |
| Gemini 2.5 Flash Preview (Reasoning) Google (Alphabet Inc.) | 84.3% | 84.3% |
| Gemini 2.5 Pro Preview (May' 25) Google (Alphabet Inc.) | 84.3% | 84.3% |
| Hermes 4 (405B) Nous Research | 81.9% | 81.9% |
| MiniMax M1 40k Minimax | 81.3% | 81.3% |
| R1 DeepSeek | 79.8% | 79.8% |
| QwQ 32B Alibaba | 78.0% | 78.0% |
| Llama 3.1 Nemotron Ultra 253B v1 (Reasoning) NVIDIA | 74.7% | 74.7% |
| Qwen3 30B A3B Instruct 2507 Alibaba | 72.7% | 72.7% |
| o1 OpenAI | 72.3% | 72.3% |
| Qwen3.235B A22b Instruct 2507 Alibaba | 71.7% | 71.7% |
| Magistral Small 1 Mistral AI | 71.3% | 71.3% |
| Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning) NVIDIA | 70.7% | 70.7% |
| Magistral Medium 1 Mistral AI | 70.0% | 70.0% |
| Kimi K2 0711 Moonshot AI | 69.3% | 69.3% |
| Solar Pro 2 (Reasoning) Upstage | 69.0% | 69.0% |
| R1 Distill Qwen 32B DeepSeek | 68.7% | 68.7% |
| GLM 4.5 Air Zai | 67.3% | 67.3% |
| R1 Distill Llama 70B DeepSeek | 67.0% | 67.0% |
| DeepSeek R1 Distill Qwen 14B DeepSeek | 66.7% | 66.7% |
| Solar Pro 2 (Preview) (Reasoning) Upstage | 66.3% | 66.3% |
| DeepSeek R1 0528 Qwen3 8B DeepSeek | 65.0% | 65.0% |
| o1 Mini OpenAI | 60.3% | 60.3% |
| Claude 4 Opus Anthropic | 56.3% | 56.3% |
| DeepSeek V3 0324 DeepSeek | 52.0% | 52.0% |
| Reka Flash 3 Reka AI | 51.0% | 51.0% |
| Gemini 2.0 Flash Thinking Experimental (Jan '25) Google (Alphabet Inc.) | 50.0% | 50.0% |
| Gemini 2.5 Flash Google (Alphabet Inc.) | 50.0% | 50.0% |
| Gemini 2.5 Flash Lite Google (Alphabet Inc.) | 50.0% | 50.0% |
| ERNIE 4.5 300B A47B Baidu | 49.3% | 49.3% |
| Sonar Perplexity AI | 48.7% | 48.7% |
| Qwen3Coder 480B A35b Instruct Alibaba | 47.7% | 47.7% |
| EXAONE 4.0 32B LG AI Research | 47.0% | 47.0% |
| QwQ 32B-Preview Alibaba | 45.3% | 45.3% |
| Mistral Medium 3 Mistral AI | 44.0% | 44.0% |
| GPT-4.1 OpenAI | 43.7% | 43.7% |
| Gemini 2.5 Flash Preview (Non-reasoning) Google (Alphabet Inc.) | 43.3% | 43.3% |
| GPT-4.1 Mini OpenAI | 43.0% | 43.0% |
172 / 172 models