0

SWE-bench Leaderboard

Princeton's canonical leaderboard for SWE-bench, SWE-bench Verified, SWE-bench Lite, and SWE-bench Multimodal, ranking coding agents by test-pass rate on real GitHub issues.

Kind
Aggregated
Updates
live·updated 3d ago
Notable for
The defining leaderboard of the 2024-2026 coding-agent wave; every major lab and AI-coding startup reports a SWE-bench Verified number.
Tracks
5 evals · aggregated

Cite

Notes

Only stored in your browser.

Per-eval breakdown

51

models

Model
DeepSeek V4 Pro

DeepSeek

80.6%----80.6%
MiniMax M2.5

Minimax

-75.8%---75.8%
Claude Opus 4.5

Anthropic

-79.2%-70.7%-75.0%
GPT-5

OpenAI

74.9%74.4%---74.7%
Gemini 3 Flash

Google (Alphabet Inc.)

-75.8%-72.7%-74.3%
Gemini 3 Pro

Google (Alphabet Inc.)

76.2%77.4%-68.7%-74.1%
Claude Opus 4.6

Anthropic

-75.6%-72.0%-73.8%
Claude 4 Opus

Anthropic

-73.2%---73.2%
Claude Sonnet 4.5

Anthropic

77.2%74.8%-67.0%-73.0%
GLM 5

Zai

-72.8%-69.7%-71.3%
Kimi K2 0905

Moonshot AI

-71.2%---71.2%
Qwen3Coder 480B A35b Instruct

Alibaba

-69.6%---69.6%
GPT-5.2

OpenAI

-71.8%-66.7%-69.3%
Kimi K2.5

Kimi

-70.8%-67.3%-69.0%
GLM 4.6

Zai

-68.2%---68.2%
GPT-5.1

OpenAI

-66.0%---66.0%
GPT-5.1-Codex

OpenAI

-66.0%---66.0%
Kimi K2 0711

Moonshot AI

65.8%----65.8%
Claude 4.5 Haiku

Anthropic

-66.6%-64.7%-65.6%
Kimi K2

Moonshot AI

-65.4%---65.4%
o1 Preview

OpenAI

-64.6%---64.6%
DeepSeek V3.2

DeepSeek

-70.0%-59.0%-64.5%
GLM 4.5

Zai

-64.2%---64.2%
Kimi K2 Thinking

Kimi

-63.4%---63.4%
MiniMax M2

Minimax

-61.0%---61.0%
Claude 4 Sonnet

Anthropic

-74.6%58.3%-35.6%56.2%
o3

OpenAI

71.7%58.4%--36.0%55.4%
Qwen3 Coder 30B A3B Instruct

Alibaba

-60.4%49.7%--55.0%
Gemini 2.5 Pro

Google (Alphabet Inc.)

-53.6%---53.6%
GPT-5 Mini

OpenAI

-59.8%-39.7%-49.8%
o4 Mini

OpenAI

-64.6%--33.9%49.2%
Claude Sonnet 3.7

Anthropic

-66.4%48.0%-31.3%48.6%
Qwen2.5 Coder 32B Instruct

Alibaba

-47.0%---47.0%
Devstral Small (May '25)

Mistral AI

-46.8%---46.8%
Claude Sonnet 3.5

Anthropic

-63.4%51.3%-25.3%46.7%
Gemini 2.0 Flash (Feb '25)

Google (Alphabet Inc.)

-44.2%---44.2%
DeepSeek V3 0324

DeepSeek

-42.0%---42.0%
Claude 3 Haiku

Anthropic

-40.6%---40.6%
GPT-4o (2024-05-13)

OpenAI

-38.8%38.0%--38.4%
o3 Mini

OpenAI

-42.4%32.3%--37.4%
DeepSeek V3

DeepSeek

--36.7%--36.7%
GPT-4.1

OpenAI

-39.6%--31.1%35.4%
GPT-5 Nano

OpenAI

-34.8%---34.8%
GPT-4o (2024-08-06)

OpenAI

-32.6%39.7%-30.4%34.2%
Gemini 2.5 Flash

Google (Alphabet Inc.)

-28.7%---28.7%
gpt-oss-120b

OpenAI

-26.0%---26.0%
GPT-4

OpenAI

-22.4%28.3%--25.4%
GPT-4.1 Mini

OpenAI

-23.9%---23.9%
GPT-4o

OpenAI

-21.6%---21.6%
Claude Opus 3

Anthropic

-15.8%4.3%--10.1%
51 / 51 models

Evals tracked

5