Humanity's Last Exam (HLE)
Frontier
2,500 expert-authored questions across math, sciences, and humanities designed to be the hardest closed-ended benchmark for frontier models.
- Publisher
- Center for AI Safety (CAIS)
- Capabilities
- Scientific ReasoningMathFactual Recall
- Format
- Custom
- Size
- 2500 tasks
- License
- MIT
- Published
- Jan 2025
- Updates
- Monthly
- Notable for
- The most cited "what's left to break" benchmark in 2026 — designed as a successor to MMLU when LLMs saturated it; top scores remain under 50% even with frontier reasoning models.
- Canonical
- lastexam.ai
- Official leaderboard
- labs.scale.com/leaderboard/humanitys_last_exam
Cite
Notes
Only stored in your browser.
Top score 45.7% by Claude Opus 4.8 - 391 models reporting (80 frontier)
Score history
390Top models
391Where it's ranked
1Related tools
4Implementations, trainers, datasets and scaffolds linked to this eval.
Papers
2Contributors
2FAQ
- What is Humanity's Last Exam (HLE)?
- 2,500 expert-authored questions across math, sciences, and humanities designed to be the hardest closed-ended benchmark for frontier models.
- What capabilities does Humanity's Last Exam (HLE) test?
- Humanity's Last Exam (HLE) evaluates scientific reasoning, math, factual recall.
- What is the current top score on Humanity's Last Exam (HLE)?
- The top reported score is 45.7% by Claude Opus 4.8, across 391 models reporting (80 from frontier labs).
- How can a model improve its Humanity's Last Exam (HLE) score?
- Tools linked to Humanity's Last Exam (HLE) on Sophon include HLE RL Env (Prime Intellect), WEB PY RL Env (Community), WEB PY RL Env (Prime Community), WEB PY RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.
- What license is Humanity's Last Exam (HLE) under?
- Humanity's Last Exam (HLE) is available under MIT.

