Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
NL2Repo Bench evaluates long-horizon software development capabilities of coding agents by assessing their ability to generate complete Python libraries from natural-language requirements.
- Year
- 2025
- Venue
- arXiv 2025
- Authors
- 48
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2512.12730ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
48Ge ZhangYujia QinZihan WangZhaoxiang ZhangQian LiuJian YangYunFei ZhaoHe ZhuKai HuaWenhao HuangJiaheng LiuMinghao LiuDaoguang ZanQizhi ChenJianpeng JiaoChao HeChenchen ZhangXiang GaoYong ShanXianfu ChengTong YangWeihao XieEnduo ZhaoYishuo YuanZaiyuan WangXinjie ChenZhaojian LiPai LiuYue HouJiayu ZhangQingshui GuMing DingMingchen LiJingzhe DingShengda LongHongwan GaoWeiran ShiChangxin PuHuan ZhouFei HuXiaoxu ZhangBo DengJuntao LinXuanguang PanZifan PengZhewen TanChenyang ZouChongyao Tao