0

Humanity's Last Exam (HLE)

Frontier

2,500 expert-authored questions across math, sciences, and humanities designed to be the hardest closed-ended benchmark for frontier models.

Format
Custom
Size
2500 tasks
License
MIT
Published
Jan 2025
Updates
Monthly
Notable for
The most cited "what's left to break" benchmark in 2026 — designed as a successor to MMLU when LLMs saturated it; top scores remain under 50% even with frontier reasoning models.
Canonical
lastexam.ai

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
VaultAAprime-hub
Attribution policy →

Top score 45.7% by Claude Opus 4.8 - 391 models reporting (80 frontier)

Score history

390
0%25%50%75%100%Mar 23Nov 23Jul 24Mar 25Nov 25Claude InstantLlama 2 Chat 7BDBRX Instructo1Gemini 2.5 Pro Preview (Mar' 25)Grok 4Gemini 3 ProGemini 3.1 Pro PreviewClaude Opus 4.8

Top models

391
Humanity's Last Exam (HLE)Bar chart with 21 bars. Highest value: Claude Opus 4.8 at 45.7.
21 models

Where it's ranked

1

Related tools

4
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Papers

2

Contributors

2

FAQ

What is Humanity's Last Exam (HLE)?
2,500 expert-authored questions across math, sciences, and humanities designed to be the hardest closed-ended benchmark for frontier models.
What capabilities does Humanity's Last Exam (HLE) test?
Humanity's Last Exam (HLE) evaluates scientific reasoning, math, factual recall.
What is the current top score on Humanity's Last Exam (HLE)?
The top reported score is 45.7% by Claude Opus 4.8, across 391 models reporting (80 from frontier labs).
How can a model improve its Humanity's Last Exam (HLE) score?
Tools linked to Humanity's Last Exam (HLE) on Sophon include HLE RL Env (Prime Intellect), WEB PY RL Env (Community), WEB PY RL Env (Prime Community), WEB PY RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.
What license is Humanity's Last Exam (HLE) under?
Humanity's Last Exam (HLE) is available under MIT.