Measuring AI Ability to Complete Long Tasks
METR's empirical study showing the wall-clock human time of tasks that AI agents can complete with 50% success rate is doubling roughly every 7 months - an exponential capability trend.
- Year
- 2025
- Venue
- preprint
- Authors
- 30
- Hosting
- External sourcelicense unknown
Cite
Notes
Only stored in your browser.
Introduces 1 artifact - 1 eval
TL;DR
Semantic Scholar
If extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month, this work proposes a new metric: 50%-task-completion time horizon, the time humans typically take to complete tasks that AI models can complete with 50% success rate.
Artifacts
1Evals
Authors
30Amy DengBen WestBeth BarnesDavid ReinElizabeth Barnes (Beth Barnes)Hjalmar WijkJoel BeckerKatherine VuMegan KinnimentMichael ChenOzzie GooenRyan HoThomas KwaKatharyn GarciaMax HasinSami JawharNate RushSydney von ArxRyan BloomThomas BroadleyHaoxing DuBrian GoodrichNikola JurkovicLuke Harold MilesSeraphina NixTao LinNeev ParikhLucas Jun Koba SatoDaniel M. ZieglerLawrence Chan