0

SWE-bench

Frontier

2,294 real GitHub issues from 12 popular Python repos that require an agent to produce a patch passing the project's test suite.

Domain
code
Format
Custom
Size
2294 tasks
License
MIT
Published
Oct 2023
Notable for
Benchmark for evaluating code editing, debugging and tool calling in the code domain.
Canonical
swebench.com

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
Vault
Attribution policy →

Top score 80.6% by DeepSeek V4 Pro - 6 models reporting (4 frontier)

Score history

6
60%70%80%90%100%Apr 25Jul 25Oct 25Jan 26Apr 26o3GPT-5Claude Sonnet 4.5DeepSeek V4 Pro

Top models

6
SWE-benchBar chart with 6 bars. Highest value: DeepSeek V4 Pro at 80.6.
6 models

Where it's ranked

1

Related tools

5
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Papers

2

Contributors

3

FAQ

What is SWE-bench?
2,294 real GitHub issues from 12 popular Python repos that require an agent to produce a patch passing the project's test suite.
What capabilities does SWE-bench test?
SWE-bench evaluates code editing, debugging, tool calling, planning.
What is the current top score on SWE-bench?
The top reported score is 80.6% by DeepSeek V4 Pro, across 6 models reporting (4 from frontier labs).
How can a model improve its SWE-bench score?
Tools linked to SWE-bench on Sophon include mini-swe-agent-plus, SWE-Gym, Agent Bench RL Env (Prime Community), SWE RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.
What license is SWE-bench under?
SWE-bench is available under MIT.