SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Introduces SWE-bench, 2,294 real GitHub issues from 12 popular Python repos paired with their merged-PR test suites - a hard agentic coding benchmark.

Open

Publisher: Princeton NLP Group
Year: 2024
Venue: ICLR
ArXiv: arxiv.org/abs/2310.06770
Code: github.com/princeton-nlp/SWE-bench
Authors: 7
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2310.06770
TL;DR: semanticscholar.org/paper/94a5f96308729e31c1ffbc0f0618db87795092fe
Code: github.com/princeton-nlp/SWE-bench

Attribution policy →

Introduces 3 artifacts - 3 evals

TL;DR

Semantic Scholar

SWE-bench is introduced, an evaluation framework consisting of software engineering problems drawn from real GitHub issues and corresponding pull requests across popular Python repositories that shows that both state-of-the-art proprietary models and the fine-tuned model SWE-Llama can resolve only the simplest issues.

Artifacts

Evals

SWE-bench Verified SWE-bench SWE-bench Lite

Authors

Alexander Wettig Carlos E. Jimenez John Yang Karthik Narasimhan Kexin Pei Ofir Press Shunyu Yao