LiveBench

Name: LiveBench
Creator: Abacus.AI

Continuously refreshed, contamination-resistant benchmark covering math, reasoning, coding, language, data analysis, and instruction-following with automatic objective scoring.

Open

Operator: Abacus.AI
Kind: Aggregated
Updates: monthly·updated 8d ago
Notable for: Co-led by Yann LeCun and the Abacus.AI team, LiveBench's anti-contamination protocol made it one of the most trusted continuously updated leaderboards.
URL: livebench.ai
Tracks: 7 evals · aggregated

Cite

Notes

Only stored in your browser.

Attribution

Scores: livebench.ai

Attribution policy →

Per-eval breakdown

142

models

Model	↗	↗	↗	↗	↗	↗	↗
Gemini 3.1 Pro Preview Google (Alphabet Inc.)	82.4%	85.4%	76.5%	79.1%	84.0%	91.0%	78.5%	82.4%
Claude Fable 5 Anthropic	81.4%	88.5%	78.6%	59.9%	87.3%	93.9%	80.0%	81.4%
Claude Opus 4.8 Anthropic	80.1%	81.4%	79.3%	67.4%	89.7%	84.3%	78.3%	80.1%
GPT-5.4 OpenAI	79.8%	83.0%	78.2%	65.0%	85.7%	90.0%	77.0%	79.8%
Claude Opus 4.7 Anthropic	79.7%	77.9%	82.1%	59.3%	87.7%	93.1%	78.3%	79.7%
Gemini 3.5 Flash Google (Alphabet Inc.)	78.9%	84.6%	78.2%	75.6%	82.0%	88.2%	64.9%	78.9%
Claude Opus 4.6 Anthropic	78.8%	83.3%	78.2%	63.3%	88.7%	89.3%	69.9%	78.8%
GPT-5.2 OpenAI	78.7%	79.8%	76.1%	61.8%	83.2%	93.2%	78.2%	78.7%
Claude Sonnet 4.6 Anthropic	78.4%	77.7%	80.0%	63.9%	86.4%	86.5%	76.1%	78.4%
GPT-5.2-Codex OpenAI	78.1%	73.7%	83.6%	66.4%	77.7%	88.8%	78.2%	78.1%
Claude Opus 4.5 Anthropic	78.1%	81.3%	79.7%	62.5%	80.1%	90.4%	74.4%	78.1%
Qwen3.7 Max Alibaba	78.1%	79.7%	74.2%	74.0%	83.3%	85.2%	71.8%	78.1%
Gemini 2.5 Pro Google (Alphabet Inc.)	77.9%	65.9%	86.7%	81.0%	-	-	-	77.9%
Gemini 3 Flash Google (Alphabet Inc.)	77.8%	84.6%	73.9%	74.9%	74.5%	84.2%	74.8%	77.8%
Claude Sonnet 5 Anthropic	77.6%	78.8%	78.6%	59.4%	86.9%	89.6%	72.5%	77.6%
GPT-5.5 OpenAI	77.3%	85.6%	78.6%	65.7%	87.3%	69.8%	77.0%	77.3%
GLM 5.2 Zai	76.7%	76.2%	79.7%	62.3%	78.6%	89.8%	73.7%	76.7%
DeepSeek V4 Pro DeepSeek	76.4%	78.1%	70.0%	62.3%	82.7%	90.7%	74.5%	76.4%
GPT-5.3-Codex OpenAI	75.7%	80.1%	78.2%	65.4%	80.2%	87.8%	62.7%	75.7%
GPT-5.1 OpenAI	75.2%	79.3%	72.5%	63.9%	78.8%	86.9%	69.6%	75.2%
Claude Sonnet 3.7 Anthropic	74.3%	62.9%	74.2%	85.7%	-	-	-	74.3%
Qwen3.6 Plus Alibaba	73.5%	75.0%	78.2%	58.3%	75.8%	83.7%	69.9%	73.5%
GLM 5.1 Zai	72.7%	71.8%	75.4%	68.5%	72.5%	84.9%	63.2%	72.7%
Kimi K2.5 Moonshot AI	72.5%	77.7%	77.9%	57.4%	76.0%	84.9%	61.4%	72.5%
Kimi K2.7 Code Kimi	72.2%	77.9%	74.0%	56.3%	82.8%	79.6%	62.7%	72.2%
Grok 4.20 Beta 0309 Reasoning xAI	72.1%	77.7%	66.1%	63.4%	75.3%	87.1%	62.9%	72.1%
o1 OpenAI	71.7%	63.5%	68.8%	82.9%	-	-	-	71.7%
MiniMax M3 Minimax	71.7%	76.8%	68.2%	57.5%	74.5%	76.9%	76.2%	71.7%
o3 Mini OpenAI	71.6%	49.5%	82.8%	82.5%	-	-	-	71.6%
GLM 5 Zai	71.2%	77.5%	73.6%	55.3%	69.1%	83.5%	67.9%	71.2%
GPT-5.1-Codex OpenAI	71.2%	69.5%	71.8%	63.4%	82.0%	79.6%	60.7%	71.2%
o1 Preview OpenAI	71.0%	77.4%	52.3%	83.2%	-	-	-	71.0%
Claude Sonnet 4.5 Anthropic	70.7%	76.5%	80.4%	53.4%	77.6%	79.3%	57.0%	70.7%
DeepSeek V4 Flash DeepSeek	70.1%	70.1%	69.2%	63.1%	70.6%	79.6%	68.0%	70.1%
GPT-4.5 (Preview) OpenAI	69.8%	62.0%	75.0%	72.3%	-	-	-	69.8%
QwQ 32B Alibaba	69.6%	47.7%	75.8%	85.3%	-	-	-	69.6%
Grok 4.3 xAI	69.5%	73.6%	69.9%	62.7%	70.8%	84.3%	55.8%	69.5%
GPT-5 Mini OpenAI	69.1%	75.5%	68.2%	65.3%	68.3%	82.2%	55.2%	69.1%
DeepSeek V3 0324 DeepSeek	68.9%	48.7%	71.1%	86.8%	-	-	-	68.9%
Qwen3.6 27B Alibaba	68.1%	63.3%	71.8%	53.2%	70.3%	79.9%	70.4%	68.1%
Grok 4 xAI	67.4%	76.4%	73.1%	29.1%	79.1%	83.0%	63.4%	67.4%
R1 DeepSeek	66.8%	49.4%	70.3%	80.6%	-	-	-	66.8%
Gemini 3.1 Flash Lite Preview Google (Alphabet Inc.)	66.4%	73.2%	68.5%	68.6%	59.7%	73.6%	54.9%	66.4%
Qwen2.5Max Alibaba	66.4%	58.4%	64.1%	76.6%	-	-	-	66.4%
DeepSeek V3.2 DeepSeek	65.9%	70.4%	64.6%	48.2%	77.2%	85.0%	50.0%	65.9%
MiniMax M2.7 Minimax	65.7%	66.8%	54.9%	61.1%	74.8%	80.5%	56.3%	65.7%
Kimi K2 Thinking Kimi	65.5%	66.5%	67.4%	62.0%	63.5%	81.1%	52.3%	65.5%
Claude 4 Sonnet Anthropic	64.8%	72.9%	77.5%	44.3%	69.0%	70.5%	54.6%	64.8%
Grok 4.1 Fast xAI	64.7%	74.3%	69.6%	28.2%	80.2%	83.7%	52.2%	64.7%
ChatGPT 4o OpenAI	64.2%	51.4%	65.6%	75.6%	-	-	-	64.2%

142 / 142 models

Evals tracked

LiveBench

LiveBench - Language

LiveBench - Coding

LiveBench - Instruction Following

LiveBench - Reasoning

LiveBench - Math

LiveBench - Data Analysis