What capabilities does SWE-bench test?

SWE-bench evaluates code editing, debugging, tool calling, planning.

What is the current top score on SWE-bench?

The top reported score is 80.6% by DeepSeek V4 Pro, across 6 models reporting (4 from frontier labs).

How can a model improve its SWE-bench score?

Tools linked to SWE-bench on Sophon include mini-swe-agent-plus, SWE-Gym, Agent Bench RL Env (Prime Community), SWE RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.

What license is SWE-bench under?

SWE-bench is available under MIT.

SWE-bench

Frontier

2,294 real GitHub issues from 12 popular Python repos that require an agent to produce a patch passing the project's test suite.

Open

Publisher: Princeton NLP Group
Capabilities: Code Editing Debugging Tool Calling Planning
Domain: code
Format: Custom
Size: 2294 tasks
License: MIT
Published: Oct 2023
Notable for: Benchmark for evaluating code editing, debugging and tool calling in the code domain.
Canonical: swebench.com
Also on: huggingface.co/datasets/princeton-nlp/SWE-bench github.com/SWE-bench/SWE-bench

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: Vault

Attribution policy →

Top score 80.6% by DeepSeek V4 Pro - 6 models reporting (4 frontier)

Score history

Top models

SWE-benchBar chart with 6 bars. Highest value: DeepSeek V4 Pro at 80.6.

6 models

Where it's ranked

SWE-bench Leaderboard

Princeton NLP Group

Aggregated

aggregated with 4 others · live

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

mini-swe-agent-plus

Prime Intellect

Verifiers env that runs the mini-swe-agent harness inside Prime Sandboxes against real GitHub issues; reward is test-suite pass.

Trains towardRL EnvCode EditingDebuggingTool Calling

SWE-Gym

University of California, Berkeley

First open training environment for real-world software-engineering agents - 2,438 Python tasks from 11 repos, each with an executable runtime and a hidden test suite.

Trains towardRL EnvCode EditingDebuggingTool Calling

Agent Bench RL Env (Prime Community)

Prime Community

Benchmarking model performance on SWE Bench in the Mini SWE Agent harness.

Trains towardRL EnvTool UseAgent

SWE RL Env (Prime Intellect)

Prime Intellect

SWE tasks (R2E-Gym, SWE-bench, ...).

Trains towardRL EnvSWECode

Swebench PRO RL Env (Prime Intellect)

Prime Intellect

SWE-bench Pro environment backed by Harbor tasks.

Trains towardRL EnvV1SWESWE Bench

Papers

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

ICLR · 2024

Introduces SWE-bench, 2,294 real GitHub issues from 12 popular Python repos paired with their merged-PR test suites - a hard agentic coding benchmark.

introduces

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

ICLR · 2024

Introduces SWE-bench, 2,294 real GitHub issues from 12 popular Python repos paired with their merged-PR test suites - a hard agentic coding benchmark.

Contributors

CCarlos E. Jimenez JJohn Yang OOfir Press

FAQ

What is SWE-bench?: 2,294 real GitHub issues from 12 popular Python repos that require an agent to produce a patch passing the project's test suite.
What capabilities does SWE-bench test?: SWE-bench evaluates code editing, debugging, tool calling, planning.
What is the current top score on SWE-bench?: The top reported score is 80.6% by DeepSeek V4 Pro, across 6 models reporting (4 from frontier labs).
How can a model improve its SWE-bench score?: Tools linked to SWE-bench on Sophon include mini-swe-agent-plus, SWE-Gym, Agent Bench RL Env (Prime Community), SWE RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.
What license is SWE-bench under?: SWE-bench is available under MIT.