Current benchmarks for evaluating the reasoning capabilities of Large Language Models (LLMs) face significant limitations: task oversimplification, data contamination, and flawed evaluation items. These deficiencies necessitate more rigorous assessment methods. To address these limitations, we introduce PHYBench, a benchmark of 500 original physics problems ranging from high school to Physics Olympiad difficulty. PHYBench addresses data contamination through original content and employs a systematic curation pipeline to eliminate flawed items. Evaluations show that PHYBench activates more tokens and provides stronger differentiation between reasoning models compared to other baselines like AIME 2024, OlympiadBench and GPQA. Even the best-performing model, Gemini 2.5 Pro, achieves only 36.9% accuracy compared to human experts' 61.9%. To further enhance evaluation precision, we introduce the Expression Edit Distance (EED) Score for mathematical expression assessment, which improves sample efficiency by 204% over binary scoring. Moreover, PHYBench effectively elicits multi-step and multi-condition reasoning, providing a platform for examining models' reasoning robustness, preferences, and deficiencies. The benchmark results and dataset are publicly available at https://www.phybench.cn/.
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models
Current benchmarks for evaluating the reasoning capabilities of Large Language Models (LLMs) face significant limitations: task oversimplification, data contamination, and flawed evaluation items.
- Year
- 2025
- Venue
- arXiv 2025
- Authors
- 54
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2504.16074v2ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
54Jiaming JiYaodong YangBohan ZhangMinghao LiShi QiuQi LiuZheyu ShenMuhan ZhangJiawei LinZeyu CaiTianyu ZhangShaoyang GuoZhuo-Yang SongYunbo SunJiashen WeiTianyu LuoYixuan YinHaoxu ZhangYi HuChenyang WangChencheng TangHaoling ChangZiheng ZhouJingtian ZhangZhangyi LiuYuku ZhangBoxuan JingXianqi YinYutong RenZizhuo FuWeike WangXudong TianAnqi LvLaifu ManJianxiang LiFeiyu TaoQihua SunZhou LiangYushu MuZhongxuan LiJing-Jun ZhangShutao ZhangXiaotian LiXingqi XiaJiahang ChenQiuhao XiongBinran WangFengyuan WangZiyang NiFan CuiChangkun ShaoQing-Hong CaoMing-Xing LuoHua Xing Zhu