0

LiveBench

Continuously refreshed, contamination-resistant benchmark covering math, reasoning, coding, language, data analysis, and instruction-following with automatic objective scoring.

Operator
Abacus.AI
Kind
Aggregated
Updates
monthly·updated 18h ago
Notable for
Co-led by Yann LeCun and the Abacus.AI team, LiveBench's anti-contamination protocol made it one of the most trusted continuously updated leaderboards.
Tracks
7 evals · aggregated

Cite

Notes

Only stored in your browser.

Per-eval breakdown

137

models

Model
Gemini 3.1 Pro Preview

Google (Alphabet Inc.)

82.4%85.4%76.5%79.1%84.0%91.0%78.5%82.4%
Claude Opus 4.8

Anthropic

80.1%81.4%79.3%67.4%89.7%84.3%78.3%80.1%
GPT-5.4

OpenAI

79.8%83.0%78.2%65.0%85.7%90.0%77.0%79.8%
Claude Opus 4.7

Anthropic

79.7%77.9%82.1%59.3%87.7%93.1%78.3%79.7%
Gemini 3.5 Flash

Google (Alphabet Inc.)

78.9%84.6%78.2%75.6%82.0%88.2%64.9%78.9%
Claude Opus 4.6

Anthropic

78.8%83.3%78.2%63.3%88.7%89.3%69.9%78.8%
GPT-5.2

OpenAI

78.7%79.8%76.1%61.8%83.2%93.2%78.2%78.7%
Claude Sonnet 4.6

Anthropic

78.4%77.7%80.0%63.9%86.4%86.5%76.1%78.4%
GPT-5.2-Codex

OpenAI

78.1%73.7%83.6%66.4%77.7%88.8%78.2%78.1%
Claude Opus 4.5

Anthropic

78.1%81.3%79.7%62.5%80.1%90.4%74.4%78.1%
Qwen3.7 Max

Alibaba

78.1%79.7%74.2%74.0%83.3%85.2%71.8%78.1%
Gemini 2.5 Pro

Google (Alphabet Inc.)

77.9%65.9%86.7%81.0%---77.9%
Gemini 3 Flash

Google (Alphabet Inc.)

77.8%84.6%73.9%74.9%74.5%84.2%74.8%77.8%
GPT-5.5

OpenAI

77.3%85.6%78.6%65.7%87.3%69.8%77.0%77.3%
DeepSeek V4 Pro

DeepSeek

76.4%78.1%70.0%62.3%82.7%90.7%74.5%76.4%
GPT-5.3-Codex

OpenAI

75.7%80.1%78.2%65.4%80.2%87.8%62.7%75.7%
GPT-5.1

OpenAI

75.2%79.3%72.5%63.9%78.8%86.9%69.6%75.2%
Claude Sonnet 3.7

Anthropic

74.3%62.9%74.2%85.7%---74.3%
Qwen3.6 Plus

Alibaba

73.5%75.0%78.2%58.3%75.8%83.7%69.9%73.5%
GLM 5.1

Zai

72.7%71.8%75.4%68.5%72.5%84.9%63.2%72.7%
Kimi K2.5

Moonshot AI

72.5%77.7%77.9%57.4%76.0%84.9%61.4%72.5%
Grok 4.20 Beta 0309 Reasoning

xAI

72.1%77.7%66.1%63.4%75.3%87.1%62.9%72.1%
o1

OpenAI

71.7%63.5%68.8%82.9%---71.7%
o3 Mini

OpenAI

71.6%49.5%82.8%82.5%---71.6%
GLM 5

Zai

71.2%77.5%73.6%55.3%69.1%83.5%67.9%71.2%
GPT-5.1-Codex

OpenAI

71.2%69.5%71.8%63.4%82.0%79.6%60.7%71.2%
o1 Preview

OpenAI

71.0%77.4%52.3%83.2%---71.0%
Claude Sonnet 4.5

Anthropic

70.7%76.5%80.4%53.4%77.6%79.3%57.0%70.7%
DeepSeek V4 Flash

DeepSeek

70.1%70.1%69.2%63.1%70.6%79.6%68.0%70.1%
GPT-4.5 (Preview)

OpenAI

69.8%62.0%75.0%72.3%---69.8%
QwQ 32B

Alibaba

69.6%47.7%75.8%85.3%---69.6%
Grok 4.3

xAI

69.5%73.6%69.9%62.7%70.8%84.3%55.8%69.5%
GPT-5 Mini

OpenAI

69.1%75.5%68.2%65.3%68.3%82.2%55.2%69.1%
DeepSeek V3 0324

DeepSeek

68.9%48.7%71.1%86.8%---68.9%
Qwen3.6 27B

Alibaba

68.1%63.3%71.8%53.2%70.3%79.9%70.4%68.1%
Grok 4

xAI

67.4%76.4%73.1%29.1%79.1%83.0%63.4%67.4%
R1

DeepSeek

66.8%49.4%70.3%80.6%---66.8%
Gemini 3.1 Flash Lite Preview

Google (Alphabet Inc.)

66.4%73.2%68.5%68.6%59.7%73.6%54.9%66.4%
Qwen2.5Max

Alibaba

66.4%58.4%64.1%76.6%---66.4%
DeepSeek V3.2

DeepSeek

65.9%70.4%64.6%48.2%77.2%85.0%50.0%65.9%
MiniMax M2.7

Minimax

65.7%66.8%54.9%61.1%74.8%80.5%56.3%65.7%
Kimi K2 Thinking

Kimi

65.5%66.5%67.4%62.0%63.5%81.1%52.3%65.5%
Claude 4 Sonnet

Anthropic

64.8%72.9%77.5%44.3%69.0%70.5%54.6%64.8%
Grok 4.1 Fast

xAI

64.7%74.3%69.6%28.2%80.2%83.7%52.2%64.7%
ChatGPT 4o

OpenAI

64.2%51.4%65.6%75.6%---64.2%
Claude 4.1 Opus

Anthropic

64.1%72.8%74.7%42.4%72.3%73.2%49.0%64.1%
GPT-5.1-Codex-Mini

OpenAI

63.8%63.0%69.9%59.0%64.7%76.3%49.7%63.8%
Claude Sonnet 3.5

Anthropic

63.7%56.4%65.6%69.2%---63.7%
GPT-5.4 Mini

OpenAI

63.6%62.4%71.5%50.8%62.0%70.4%64.3%63.6%
DeepSeek V3.2 Exp

DeepSeek

63.4%71.1%70.1%41.3%64.4%82.4%51.5%63.4%
137 / 137 models

Evals tracked

7