Recent advances in large language models (LLMs) have led to remarkable progress across domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for AI, involving multimodal source interpretation, temporal inference, and cross-linguistic analysis. While general-purpose agents perform well on many existing benchmarks, they lack the domain-specific expertise required to engage with historical materials and questions. To address this gap, we introduce HistBench, a new benchmark of 414 high-quality questions designed to evaluate AI's capacity for historical reasoning and authored by more than 40 expert contributors. The tasks span a wide range of historical problems-from factual retrieval based on primary sources to interpretive analysis of manuscripts and images, to interdisciplinary challenges involving archaeology, linguistics, or cultural history. Furthermore, the benchmark dataset spans 29 ancient and modern languages and covers a wide range of historical periods and world regions. Finding the poor performance of LLMs and other agents on HistBench, we further present HistAgent, a history-specific agent equipped with carefully designed tools for OCR, translation, archival search, and image understanding in History. On HistBench, HistAgent based on GPT-4o achieves an accuracy of 27.54% pass@1 and 36.47% pass@2, significantly outperforming LLMs with online search and generalist agents, including GPT-4o (18.60%), DeepSeek-R1(14.49%) and Open Deep Research-smolagents(20.29% pass@1 and 25.12% pass@2). These results highlight the limitations of existing LLMs and generalist agents and demonstrate the advantages of HistAgent for historical reasoning.
On Path to Multimodal Historical Reasoning: HistBench and HistAgent
HistBench, a benchmark with expert-authored historical questions, demonstrates HistAgent's superiority over general LLMs and agents for historical reasoning tasks.
- Year
- 2025
- Venue
- arXiv 2025
- Authors
- 98
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2505.20246ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
98Wentao ZhangYujia WuJiaqi LiShilong LiuHan XiaYang WangLing YangMengdi WangYing ZhaoZhuoran LiZeyu WangHongru WangYuchen YangJiahao QiuXuan QiTongcheng ZhangXinzhe JuanJiacheng GuoYifu LuYimin WangZixin YaoXun JiangKaixuan HuangXudong LiuYue ChenHao XinTianyi WangYao ShuSiran WangFulian XiaoYuchen MaoYijia ChenCharles ArgonJundi CuiDaixin ChenJunran ZhouShuyao ZhouZhanpeng ZhouYuming CaoYunfei ChenZhengyi ChenRuowei DaiMengqiu DengJiye FuYunting GuZijie GuanZirui HuangXiaoyan JiYumeng JiangDelong KongHaolong LiRuipeng LiTianze LiHaixia LianMengyue LinJiayi LuJinghan LuWanyu LuoZiyue LuoZihao PuZhi QiaoRuihuan RenLiang WanRuixiang WangTianhui WangZihua WangZhaoyi WuWeiao XingRuojun XiongWeijie XuXiao YaoXiaorui YangNan YiJiadong YuYangyuxuan YuHuiting ZengDanni ZhangYunjie ZhangZhaoyu ZhangZhiheng ZhangXiaofeng ZhengPeirong ZhouLinyan ZhongXiaoyin ZongZhenxin ChenLin DingXiaoyu GaoBingbing GongYichao LiYang LiaoGuang MaTianyuan MaXinrui SunRuobing XianGen YeTengfei YuYuxi WangXi Gao