What capabilities does Humanity's Last Exam (HLE) test?

Humanity's Last Exam (HLE) evaluates scientific reasoning, math, factual recall.

What is the current top score on Humanity's Last Exam (HLE)?

The top reported score is 53.3% by Claude Fable 5, across 436 models reporting (105 from frontier labs).

How can a model improve its Humanity's Last Exam (HLE) score?

Tools linked to Humanity's Last Exam (HLE) on Sophon include HLE RL Env (Prime Intellect), WEB PY RL Env (Community), WEB PY RL Env (Prime Community), WEB PY RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.

What license is Humanity's Last Exam (HLE) under?

Humanity's Last Exam (HLE) is available under MIT.

Humanity's Last Exam (HLE)

Frontier

2,500 expert-authored questions across math, sciences, and humanities designed to be the hardest closed-ended benchmark for frontier models.

Open

Publisher: Center for AI Safety (CAIS)
Capabilities: Scientific Reasoning Math Factual Recall
Format: Custom
Size: 2500 tasks
License: MIT
Published: Jan 2025
Updates: Monthly
Notable for: The most cited "what's left to break" benchmark in 2026 — designed as a successor to MMLU when LLMs saturated it; top scores remain under 50% even with frontier reasoning models.
Canonical: lastexam.ai
Official leaderboard: labs.scale.com/leaderboard/humanitys_last_exam
Also on: huggingface.co/datasets/cais/hle

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: VaultAA Anthropic prime-hub

Attribution policy →

Top score 53.3% by Claude Fable 5 - 436 models reporting (105 frontier)

Score history

433

Top models

436

Humanity's Last Exam (HLE)Bar chart with 21 bars. Highest value: Claude Mythos Preview at 64.7.

21 models

Where it's ranked

Official leaderboard

labs.scale.com

Single benchmark

monthly

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

HLE RL Env (Prime Intellect)

Prime Intellect

Humanity's Last Exam evaluation environment

ImplementationRL EnvReasoningTool UseMulti Modal

WEB PY RL Env (Community)

Humanity's Last Examination (HLE) benchmark environment for prime-environments

Trains towardRL EnvReasoningTool UseHle

WEB PY RL Env (Prime Community)

Prime Community

Humanity's Last Examination (HLE) benchmark environment for Prime Community Environments

Trains towardRL EnvReasoningTool UseHle

WEB PY RL Env (Prime Intellect)

Prime Intellect

Humanity's Last Examination (HLE) benchmark environment for prime-environments

Trains towardRL EnvReasoningTool UseHle

Papers

Humanity's Last Exam

preprint · 2025

CAIS + Scale AI benchmark of ~3,000 expert-authored questions spanning every academic subject, designed to be the hardest closed-ended exam for frontier models.

introduces

Humanity's Last Exam

preprint · 2025

CAIS + Scale AI benchmark of ~3,000 expert-authored questions spanning every academic subject, designed to be the hardest closed-ended exam for frontier models.

Contributors

DDan Hendrycks SScale AI

FAQ

What is Humanity's Last Exam (HLE)?: 2,500 expert-authored questions across math, sciences, and humanities designed to be the hardest closed-ended benchmark for frontier models.
What capabilities does Humanity's Last Exam (HLE) test?: Humanity's Last Exam (HLE) evaluates scientific reasoning, math, factual recall.
What is the current top score on Humanity's Last Exam (HLE)?: The top reported score is 53.3% by Claude Fable 5, across 436 models reporting (105 from frontier labs).
How can a model improve its Humanity's Last Exam (HLE) score?: Tools linked to Humanity's Last Exam (HLE) on Sophon include HLE RL Env (Prime Intellect), WEB PY RL Env (Community), WEB PY RL Env (Prime Community), WEB PY RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.
What license is Humanity's Last Exam (HLE) under?: Humanity's Last Exam (HLE) is available under MIT.