debugging
- Slug
debugging- Evals
- 7
- Tools
- 16
- Models
- 288
- Papers
- 6
Evals testing this capability
7Tools lifting evals here
16Top models on this capability
288by avg parsed score across evals here
Papers in this area
6introducesMeasuring AI Ability to Complete Long TasksintroducesLiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for CodeintroducesSWE-bench: Can Language Models Resolve Real-World GitHub Issues?introducesSWE-Gym: An Open Environment for Training Software Engineering Agents and VerifiersintroducesTerminal-Bench: A Benchmark for Real-World Terminal-Based AgentsIntroducing Terminal-Bench

