What capabilities does SWE-bench Verified test?

SWE-bench Verified evaluates code editing, debugging, tool calling, planning.

What is the current top score on SWE-bench Verified?

The top reported score is 79.2% by Claude Opus 4.5, across 48 models reporting (30 from frontier labs).

How can a model improve its SWE-bench Verified score?

Tools linked to SWE-bench Verified on Sophon include mini-swe-agent-plus, SWE-Gym, Agent Bench RL Env (Prime Community), Deepswe RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.

What license is SWE-bench Verified under?

SWE-bench Verified is available under MIT.

SWE-bench Verified

Frontier

500 human-validated SWE-bench tasks confirmed solvable from the issue alone, with non-flaky test suites - the most-reported agentic coding benchmark.

Open

Publisher: OpenAI
Capabilities: Code Editing Debugging Tool Calling Planning
Domain: code
Format: Custom
Size: 500 tasks
License: MIT
Published: Oct 2023
Notable for: Benchmark for evaluating code editing, debugging and tool calling in the code domain.
Canonical: swebench.com
Also on: openai.com/index/introducing-swe-bench-verified huggingface.co/datasets/princeton-nlp/SWE-bench_Verified

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: SWE-bench

Attribution policy →

Top score 79.2% by Claude Opus 4.5 - 48 models reporting (30 frontier)

Score history

Top models

SWE-bench VerifiedBar chart with 21 bars. Highest value: Claude Opus 4.5 at 79.2.

21 models

Where it's ranked

SWE-bench Leaderboard

Princeton NLP Group

Aggregated

aggregated with 4 others · live

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

mini-swe-agent-plus

Prime Intellect

Verifiers env that runs the mini-swe-agent harness inside Prime Sandboxes against real GitHub issues; reward is test-suite pass.

Trains towardRL EnvCode EditingDebuggingTool Calling

SWE-Gym

University of California, Berkeley

First open training environment for real-world software-engineering agents - 2,438 Python tasks from 11 repos, each with an executable runtime and a hidden test suite.

Trains towardRL EnvCode EditingDebuggingTool Calling

Agent Bench RL Env (Prime Community)

Prime Community

Benchmarking model performance on SWE Bench in the Mini SWE Agent harness.

Trains towardRL EnvTool UseAgent

Deepswe RL Env (Prime Intellect)

Prime Intellect

DeepSWE environment for solving SWE issues inside Prime Sandboxes.

Trains towardRL EnvSWECode

Agent PLUS RL Env (Prime Intellect)

Prime Intellect

Mini SWE Agent Plus environment for solving SWE issues inside Prime Sandboxes.

Trains towardRL EnvSWECode

Opencode SWE RL Env (Prime Intellect)

Prime Intellect

OpenCode SWE environment for solving SWE issues inside Prime Sandboxes.

Trains towardRL EnvSWECode

Papers

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

ICLR · 2024

Introduces SWE-bench, 2,294 real GitHub issues from 12 popular Python repos paired with their merged-PR test suites - a hard agentic coding benchmark.

introduces

Contributors

CCarlos E. Jimenez JJohn Yang

FAQ

What is SWE-bench Verified?: 500 human-validated SWE-bench tasks confirmed solvable from the issue alone, with non-flaky test suites - the most-reported agentic coding benchmark.
What capabilities does SWE-bench Verified test?: SWE-bench Verified evaluates code editing, debugging, tool calling, planning.
What is the current top score on SWE-bench Verified?: The top reported score is 79.2% by Claude Opus 4.5, across 48 models reporting (30 from frontier labs).
How can a model improve its SWE-bench Verified score?: Tools linked to SWE-bench Verified on Sophon include mini-swe-agent-plus, SWE-Gym, Agent Bench RL Env (Prime Community), Deepswe RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.
What license is SWE-bench Verified under?: SWE-bench Verified is available under MIT.