SWE-bench Verified
Frontier
500 human-validated SWE-bench tasks confirmed solvable from the issue alone, with non-flaky test suites - the most-reported agentic coding benchmark.
- Publisher
- OpenAI
- Capabilities
- Code EditingDebuggingTool CallingPlanning
- Domain
- code
- Format
- Custom
- Size
- 500 tasks
- License
- MIT
- Published
- Oct 2023
- Notable for
- Benchmark for evaluating code editing, debugging and tool calling in the code domain.
- Canonical
- swebench.com
Cite
Notes
Only stored in your browser.
Top score 79.2% by Claude Opus 4.5 - 48 models reporting (30 frontier)
Score history
46Top models
48Where it's ranked
1Related tools
6Implementations, trainers, datasets and scaffolds linked to this eval.
Papers
1Contributors
2FAQ
- What is SWE-bench Verified?
- 500 human-validated SWE-bench tasks confirmed solvable from the issue alone, with non-flaky test suites - the most-reported agentic coding benchmark.
- What capabilities does SWE-bench Verified test?
- SWE-bench Verified evaluates code editing, debugging, tool calling, planning.
- What is the current top score on SWE-bench Verified?
- The top reported score is 79.2% by Claude Opus 4.5, across 48 models reporting (30 from frontier labs).
- How can a model improve its SWE-bench Verified score?
- Tools linked to SWE-bench Verified on Sophon include mini-swe-agent-plus, SWE-Gym, Agent Bench RL Env (Prime Community), Deepswe RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.
- What license is SWE-bench Verified under?
- SWE-bench Verified is available under MIT.

