SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Introduces SWE-bench, 2,294 real GitHub issues from 12 popular Python repos paired with their merged-PR test suites - a hard agentic coding benchmark.
- Publisher
- Princeton NLP Group
- Year
- 2024
- Venue
- ICLR
- Authors
- 7
- Hosting
- External sourcelicense unknown
Cite
Notes
Only stored in your browser.
Introduces 3 artifacts - 3 evals
TL;DR
Semantic Scholar
SWE-bench is introduced, an evaluation framework consisting of software engineering problems drawn from real GitHub issues and corresponding pull requests across popular Python repositories that shows that both state-of-the-art proprietary models and the fine-tuned model SWE-Llama can resolve only the simplest issues.