SWE-bench
Frontier
2,294 real GitHub issues from 12 popular Python repos that require an agent to produce a patch passing the project's test suite.
- Publisher
- Princeton NLP Group
- Capabilities
- Code EditingDebuggingTool CallingPlanning
- Domain
- code
- Format
- Custom
- Size
- 2294 tasks
- License
- MIT
- Published
- Oct 2023
- Notable for
- Benchmark for evaluating code editing, debugging and tool calling in the code domain.
- Canonical
- swebench.com
Cite
Notes
Only stored in your browser.
Top score 80.6% by DeepSeek V4 Pro - 6 models reporting (4 frontier)
Score history
6Top models
6Where it's ranked
1Related tools
5Implementations, trainers, datasets and scaffolds linked to this eval.
Papers
2Contributors
3FAQ
- What is SWE-bench?
- 2,294 real GitHub issues from 12 popular Python repos that require an agent to produce a patch passing the project's test suite.
- What capabilities does SWE-bench test?
- SWE-bench evaluates code editing, debugging, tool calling, planning.
- What is the current top score on SWE-bench?
- The top reported score is 80.6% by DeepSeek V4 Pro, across 6 models reporting (4 from frontier labs).
- How can a model improve its SWE-bench score?
- Tools linked to SWE-bench on Sophon include mini-swe-agent-plus, SWE-Gym, Agent Bench RL Env (Prime Community), SWE RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.
- What license is SWE-bench under?
- SWE-bench is available under MIT.
