What capabilities does SWE-bench Lite test?

SWE-bench Lite evaluates code editing, debugging, tool calling.

What is the current top score on SWE-bench Lite?

The top reported score is 58.3% by Claude 4 Sonnet, across 11 models reporting (10 from frontier labs).

How can a model improve its SWE-bench Lite score?

Tools linked to SWE-bench Lite on Sophon include SWE-Gym, Agent Bench RL Env (Prime Community), Deepswe RL Env (Prime Intellect), Agent PLUS RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.

What license is SWE-bench Lite under?

SWE-bench Lite is available under MIT.

SWE-bench Lite

Frontier

300-issue subset of SWE-bench focused on functional bug fixes that are easier to evaluate - used for fast iteration before full SWE-bench runs.

Open

Publisher: Princeton University
Capabilities: Code Editing Debugging Tool Calling
Domain: code
Format: Custom
Size: 300 tasks
License: MIT
Published: Oct 2023
Notable for: Benchmark for evaluating code editing, debugging and tool calling in the code domain.
Canonical: swebench.com/lite.html
Also on: huggingface.co/datasets/princeton-nlp/SWE-bench_Lite

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: SWE-bench

Attribution policy →

Top score 58.3% by Claude 4 Sonnet - 11 models reporting (10 frontier)

Score history

Top models

SWE-bench LiteBar chart with 11 bars. Highest value: Claude 4 Sonnet at 58.3.

11 models

Where it's ranked

SWE-bench Leaderboard

Princeton NLP Group

Aggregated

aggregated with 4 others · live

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

SWE-Gym

University of California, Berkeley

First open training environment for real-world software-engineering agents - 2,438 Python tasks from 11 repos, each with an executable runtime and a hidden test suite.

Trains towardRL EnvCode EditingTool CallingDebugging

Agent Bench RL Env (Prime Community)

Prime Community

Benchmarking model performance on SWE Bench in the Mini SWE Agent harness.

Trains towardRL EnvTool UseAgent

Deepswe RL Env (Prime Intellect)

Prime Intellect

DeepSWE environment for solving SWE issues inside Prime Sandboxes.

Trains towardRL EnvSWECode

Agent PLUS RL Env (Prime Intellect)

Prime Intellect

Mini SWE Agent Plus environment for solving SWE issues inside Prime Sandboxes.

Trains towardRL EnvSWECode

Papers

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

ICLR · 2024

Introduces SWE-bench, 2,294 real GitHub issues from 12 popular Python repos paired with their merged-PR test suites - a hard agentic coding benchmark.

introduces

FAQ

What is SWE-bench Lite?: 300-issue subset of SWE-bench focused on functional bug fixes that are easier to evaluate - used for fast iteration before full SWE-bench runs.
What capabilities does SWE-bench Lite test?: SWE-bench Lite evaluates code editing, debugging, tool calling.
What is the current top score on SWE-bench Lite?: The top reported score is 58.3% by Claude 4 Sonnet, across 11 models reporting (10 from frontier labs).
How can a model improve its SWE-bench Lite score?: Tools linked to SWE-bench Lite on Sophon include SWE-Gym, Agent Bench RL Env (Prime Community), Deepswe RL Env (Prime Intellect), Agent PLUS RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.
What license is SWE-bench Lite under?: SWE-bench Lite is available under MIT.