0

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Introduces SWE-bench, 2,294 real GitHub issues from 12 popular Python repos paired with their merged-PR test suites - a hard agentic coding benchmark.

Year
2024
Venue
ICLR
Authors
7
Hosting
External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Introduces 3 artifacts - 3 evals

TL;DR

Semantic Scholar

SWE-bench is introduced, an evaluation framework consisting of software engineering problems drawn from real GitHub issues and corresponding pull requests across popular Python repositories that shows that both state-of-the-art proprietary models and the fine-tuned model SWE-Llama can resolve only the simplest issues.

Artifacts

3

Authors

7