Measuring AI Ability to Complete Long Tasks

METR's empirical study showing the wall-clock human time of tasks that AI agents can complete with 50% success rate is doubling roughly every 7 months - an exponential capability trend.

Open

Preview
Publisher: METR (Model Evaluation and Threat Research)
Year: 2025
Venue: preprint
ArXiv: arxiv.org/abs/2503.14499
URL: metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks
Code: github.com/METR/eval-analysis-public
Authors: 30
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks
TL;DR: semanticscholar.org/paper/dee7d1a1f04d221e1b49d45ae5b78a544d4ef54a
Code: github.com/METR/eval-analysis-public

Attribution policy →

Introduces 1 artifact - 1 eval

TL;DR

Semantic Scholar

If extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month, this work proposes a new metric: 50%-task-completion time horizon, the time humans typically take to complete tasks that AI models can complete with 50% success rate.

Artifacts

Evals

HCAST

TL;DR

Artifacts

Authors