0

SWE-bench Verified

Frontier

500 human-validated SWE-bench tasks confirmed solvable from the issue alone, with non-flaky test suites - the most-reported agentic coding benchmark.

Open
Publisher
OpenAI
Domain
code
Format
Custom
Size
500 tasks
License
MIT
Published
Oct 2023
Notable for
Benchmark for evaluating code editing, debugging and tool calling in the code domain.
Canonical
swebench.com

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
SWE-bench
Attribution policy →

Top score 79.2% by Claude Opus 4.5 - 48 models reporting (30 frontier)

Score history

46
0%25%50%75%100%Mar 23Nov 23Jul 24Mar 25Nov 25GPT-4Claude 3 Haikuo1 PreviewClaude Sonnet 3.7Claude 4 SonnetClaude Sonnet 4.5Claude Opus 4.5

Top models

48
SWE-bench VerifiedBar chart with 21 bars. Highest value: Claude Opus 4.5 at 79.2.
21 models

Where it's ranked

1

Related tools

6
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Papers

1

Contributors

2

FAQ

What is SWE-bench Verified?
500 human-validated SWE-bench tasks confirmed solvable from the issue alone, with non-flaky test suites - the most-reported agentic coding benchmark.
What capabilities does SWE-bench Verified test?
SWE-bench Verified evaluates code editing, debugging, tool calling, planning.
What is the current top score on SWE-bench Verified?
The top reported score is 79.2% by Claude Opus 4.5, across 48 models reporting (30 from frontier labs).
How can a model improve its SWE-bench Verified score?
Tools linked to SWE-bench Verified on Sophon include mini-swe-agent-plus, SWE-Gym, Agent Bench RL Env (Prime Community), Deepswe RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.
What license is SWE-bench Verified under?
SWE-bench Verified is available under MIT.