AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/ .
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains.
- Year
- 2026
- Venue
- arXiv 2026
- Stars
- 2.3k
- Authors
- 85
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2601.11868ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Topics
2Abstract
Authors
85Andy KonwinskiEtash GuhaJohn YangLudwig SchmidtMarianna NezhurinaNegin RaoofNiklas MuennighoffRyan MartenZilong WangMike A. MerrillAlexander G. ShawNicholas CarliniBoxuan LiHarsh RajIvan BercovichLin ShiJeong Yeon ShinThomas WalsheE. Kelly BuchananJunhong ShenGuanghao YeHaowei LinJason PoulosMaoyu WangJenia JitsevDi LuOrfeas Menis MastromichalakisZhiwei XuZizhao ChenYue LiuRobert ZhangLeon Liangyu ChenAnurag KashyapJan-Lucas UsluJeffrey LiJianbo WuMinghao YanSong BianVedang SharmaKe SunSteven DillmannAkshay AnandAndrew LanpouthakounBardia KoopahChangran HuGabriel H. S. DreimanJiacheng ZhuKarl KrauthLi ZhongRobert AmanfuShangyin TanShreyas PimpalgaonkarTushar AggarwalXiangning LinXin LanXuandong ZhaoYiqing LiangYuanli WangChangzhi ZhouDavid HeinemanHange LiuHarsh TrivediJunhong LinManish ShettyMichael YangNabil OmiShanda LiTerry Yue ZhuoWuwei LinYiwei DaiYuxin WangWenhao ChaiShang ZhouDariush WahdanyZiyu SheJiaming HuZhikang DongYuxuan ZhuSasha CuiAhson SaiyedArinbjörn KolbeinssonJesse HuChristopher Michael RyttingYixin WangAlex Dimakis