0

Measuring AI Ability to Complete Long Tasks

METR's empirical study showing the wall-clock human time of tasks that AI agents can complete with 50% success rate is doubling roughly every 7 months - an exponential capability trend.

Introduces 1 artifact - 1 eval

TL;DR

Semantic Scholar

If extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month, this work proposes a new metric: 50%-task-completion time horizon, the time humans typically take to complete tasks that AI models can complete with 50% success rate.

Artifacts

1

Evals

Authors

30