Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 641 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model-assisted auditing, and final adjudication, resulting in 1,170 revised-and-certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate seven state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7--10 percentage points on HLE-Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30--40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: https://github.com/SKYLENAGE-AI/HLE-Verified
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
HLE-Verified presents a validated and revised version of the HLE benchmark with improved reliability through expert review and model-based checks, demonstrating significant accuracy improvements in language model evaluations.
- Year
- 2026
- Venue
- arXiv 2026
- Authors
- 35
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2602.13964ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
35Junyang LinAn YangBowen YuDayiheng LiuHu WeiPeng WangAnfeng LiBen WangYuhao ZhouZeyu WangXingzhe WuBing ZhaoQue ShenJialong ChenXiang XuYiyuan LiYaxuan WangYijie HuJianan YeBohan WangZhihai WangBoyu YangWeiqi ZhaiJinghang WangXiaogang LiQiyuan FengShoulin HanWenjie LuoRuixian LuoGuojie LinPeiyao XiaoChengliang XuZichao ChenZongwen ShenYuliang Xu