LiveBench
Continuously refreshed, contamination-resistant benchmark covering math, reasoning, coding, language, data analysis, and instruction-following with automatic objective scoring.
- Operator
- Abacus.AI
- Kind
- Aggregated
- Updates
- monthly·updated 18h ago
- Notable for
- Co-led by Yann LeCun and the Abacus.AI team, LiveBench's anti-contamination protocol made it one of the most trusted continuously updated leaderboards.
- URL
- livebench.ai
- Tracks
- 7 evals · aggregated
Cite
Notes
Only stored in your browser.
Per-eval breakdown
137models
| Model | ||||||||
|---|---|---|---|---|---|---|---|---|
| Gemini 3.1 Pro Preview Google (Alphabet Inc.) | 82.4% | 85.4% | 76.5% | 79.1% | 84.0% | 91.0% | 78.5% | 82.4% |
| Claude Opus 4.8 Anthropic | 80.1% | 81.4% | 79.3% | 67.4% | 89.7% | 84.3% | 78.3% | 80.1% |
| GPT-5.4 OpenAI | 79.8% | 83.0% | 78.2% | 65.0% | 85.7% | 90.0% | 77.0% | 79.8% |
| Claude Opus 4.7 Anthropic | 79.7% | 77.9% | 82.1% | 59.3% | 87.7% | 93.1% | 78.3% | 79.7% |
| Gemini 3.5 Flash Google (Alphabet Inc.) | 78.9% | 84.6% | 78.2% | 75.6% | 82.0% | 88.2% | 64.9% | 78.9% |
| Claude Opus 4.6 Anthropic | 78.8% | 83.3% | 78.2% | 63.3% | 88.7% | 89.3% | 69.9% | 78.8% |
| GPT-5.2 OpenAI | 78.7% | 79.8% | 76.1% | 61.8% | 83.2% | 93.2% | 78.2% | 78.7% |
| Claude Sonnet 4.6 Anthropic | 78.4% | 77.7% | 80.0% | 63.9% | 86.4% | 86.5% | 76.1% | 78.4% |
| GPT-5.2-Codex OpenAI | 78.1% | 73.7% | 83.6% | 66.4% | 77.7% | 88.8% | 78.2% | 78.1% |
| Claude Opus 4.5 Anthropic | 78.1% | 81.3% | 79.7% | 62.5% | 80.1% | 90.4% | 74.4% | 78.1% |
| Qwen3.7 Max Alibaba | 78.1% | 79.7% | 74.2% | 74.0% | 83.3% | 85.2% | 71.8% | 78.1% |
| Gemini 2.5 Pro Google (Alphabet Inc.) | 77.9% | 65.9% | 86.7% | 81.0% | - | - | - | 77.9% |
| Gemini 3 Flash Google (Alphabet Inc.) | 77.8% | 84.6% | 73.9% | 74.9% | 74.5% | 84.2% | 74.8% | 77.8% |
| GPT-5.5 OpenAI | 77.3% | 85.6% | 78.6% | 65.7% | 87.3% | 69.8% | 77.0% | 77.3% |
| DeepSeek V4 Pro DeepSeek | 76.4% | 78.1% | 70.0% | 62.3% | 82.7% | 90.7% | 74.5% | 76.4% |
| GPT-5.3-Codex OpenAI | 75.7% | 80.1% | 78.2% | 65.4% | 80.2% | 87.8% | 62.7% | 75.7% |
| GPT-5.1 OpenAI | 75.2% | 79.3% | 72.5% | 63.9% | 78.8% | 86.9% | 69.6% | 75.2% |
| Claude Sonnet 3.7 Anthropic | 74.3% | 62.9% | 74.2% | 85.7% | - | - | - | 74.3% |
| Qwen3.6 Plus Alibaba | 73.5% | 75.0% | 78.2% | 58.3% | 75.8% | 83.7% | 69.9% | 73.5% |
| GLM 5.1 Zai | 72.7% | 71.8% | 75.4% | 68.5% | 72.5% | 84.9% | 63.2% | 72.7% |
| Kimi K2.5 Moonshot AI | 72.5% | 77.7% | 77.9% | 57.4% | 76.0% | 84.9% | 61.4% | 72.5% |
| Grok 4.20 Beta 0309 Reasoning xAI | 72.1% | 77.7% | 66.1% | 63.4% | 75.3% | 87.1% | 62.9% | 72.1% |
| o1 OpenAI | 71.7% | 63.5% | 68.8% | 82.9% | - | - | - | 71.7% |
| o3 Mini OpenAI | 71.6% | 49.5% | 82.8% | 82.5% | - | - | - | 71.6% |
| GLM 5 Zai | 71.2% | 77.5% | 73.6% | 55.3% | 69.1% | 83.5% | 67.9% | 71.2% |
| GPT-5.1-Codex OpenAI | 71.2% | 69.5% | 71.8% | 63.4% | 82.0% | 79.6% | 60.7% | 71.2% |
| o1 Preview OpenAI | 71.0% | 77.4% | 52.3% | 83.2% | - | - | - | 71.0% |
| Claude Sonnet 4.5 Anthropic | 70.7% | 76.5% | 80.4% | 53.4% | 77.6% | 79.3% | 57.0% | 70.7% |
| DeepSeek V4 Flash DeepSeek | 70.1% | 70.1% | 69.2% | 63.1% | 70.6% | 79.6% | 68.0% | 70.1% |
| GPT-4.5 (Preview) OpenAI | 69.8% | 62.0% | 75.0% | 72.3% | - | - | - | 69.8% |
| QwQ 32B Alibaba | 69.6% | 47.7% | 75.8% | 85.3% | - | - | - | 69.6% |
| Grok 4.3 xAI | 69.5% | 73.6% | 69.9% | 62.7% | 70.8% | 84.3% | 55.8% | 69.5% |
| GPT-5 Mini OpenAI | 69.1% | 75.5% | 68.2% | 65.3% | 68.3% | 82.2% | 55.2% | 69.1% |
| DeepSeek V3 0324 DeepSeek | 68.9% | 48.7% | 71.1% | 86.8% | - | - | - | 68.9% |
| Qwen3.6 27B Alibaba | 68.1% | 63.3% | 71.8% | 53.2% | 70.3% | 79.9% | 70.4% | 68.1% |
| Grok 4 xAI | 67.4% | 76.4% | 73.1% | 29.1% | 79.1% | 83.0% | 63.4% | 67.4% |
| R1 DeepSeek | 66.8% | 49.4% | 70.3% | 80.6% | - | - | - | 66.8% |
| Gemini 3.1 Flash Lite Preview Google (Alphabet Inc.) | 66.4% | 73.2% | 68.5% | 68.6% | 59.7% | 73.6% | 54.9% | 66.4% |
| Qwen2.5Max Alibaba | 66.4% | 58.4% | 64.1% | 76.6% | - | - | - | 66.4% |
| DeepSeek V3.2 DeepSeek | 65.9% | 70.4% | 64.6% | 48.2% | 77.2% | 85.0% | 50.0% | 65.9% |
| MiniMax M2.7 Minimax | 65.7% | 66.8% | 54.9% | 61.1% | 74.8% | 80.5% | 56.3% | 65.7% |
| Kimi K2 Thinking Kimi | 65.5% | 66.5% | 67.4% | 62.0% | 63.5% | 81.1% | 52.3% | 65.5% |
| Claude 4 Sonnet Anthropic | 64.8% | 72.9% | 77.5% | 44.3% | 69.0% | 70.5% | 54.6% | 64.8% |
| Grok 4.1 Fast xAI | 64.7% | 74.3% | 69.6% | 28.2% | 80.2% | 83.7% | 52.2% | 64.7% |
| ChatGPT 4o OpenAI | 64.2% | 51.4% | 65.6% | 75.6% | - | - | - | 64.2% |
| Claude 4.1 Opus Anthropic | 64.1% | 72.8% | 74.7% | 42.4% | 72.3% | 73.2% | 49.0% | 64.1% |
| GPT-5.1-Codex-Mini OpenAI | 63.8% | 63.0% | 69.9% | 59.0% | 64.7% | 76.3% | 49.7% | 63.8% |
| Claude Sonnet 3.5 Anthropic | 63.7% | 56.4% | 65.6% | 69.2% | - | - | - | 63.7% |
| GPT-5.4 Mini OpenAI | 63.6% | 62.4% | 71.5% | 50.8% | 62.0% | 70.4% | 64.3% | 63.6% |
| DeepSeek V3.2 Exp DeepSeek | 63.4% | 71.1% | 70.1% | 41.3% | 64.4% | 82.4% | 51.5% | 63.4% |
137 / 137 models