SWE-bench Leaderboard
Princeton's canonical leaderboard for SWE-bench, SWE-bench Verified, SWE-bench Lite, and SWE-bench Multimodal, ranking coding agents by test-pass rate on real GitHub issues.
- Operator
- Princeton NLP Group
- Kind
- Aggregated
- Updates
- live·updated 3d ago
- Notable for
- The defining leaderboard of the 2024-2026 coding-agent wave; every major lab and AI-coding startup reports a SWE-bench Verified number.
- URL
- swebench.com
- Tracks
- 5 evals · aggregated
Cite
Notes
Only stored in your browser.
Per-eval breakdown
51models
| Model | ||||||
|---|---|---|---|---|---|---|
| DeepSeek V4 Pro DeepSeek | 80.6% | - | - | - | - | 80.6% |
| MiniMax M2.5 Minimax | - | 75.8% | - | - | - | 75.8% |
| Claude Opus 4.5 Anthropic | - | 79.2% | - | 70.7% | - | 75.0% |
| GPT-5 OpenAI | 74.9% | 74.4% | - | - | - | 74.7% |
| Gemini 3 Flash Google (Alphabet Inc.) | - | 75.8% | - | 72.7% | - | 74.3% |
| Gemini 3 Pro Google (Alphabet Inc.) | 76.2% | 77.4% | - | 68.7% | - | 74.1% |
| Claude Opus 4.6 Anthropic | - | 75.6% | - | 72.0% | - | 73.8% |
| Claude 4 Opus Anthropic | - | 73.2% | - | - | - | 73.2% |
| Claude Sonnet 4.5 Anthropic | 77.2% | 74.8% | - | 67.0% | - | 73.0% |
| GLM 5 Zai | - | 72.8% | - | 69.7% | - | 71.3% |
| Kimi K2 0905 Moonshot AI | - | 71.2% | - | - | - | 71.2% |
| Qwen3Coder 480B A35b Instruct Alibaba | - | 69.6% | - | - | - | 69.6% |
| GPT-5.2 OpenAI | - | 71.8% | - | 66.7% | - | 69.3% |
| Kimi K2.5 Kimi | - | 70.8% | - | 67.3% | - | 69.0% |
| GLM 4.6 Zai | - | 68.2% | - | - | - | 68.2% |
| GPT-5.1 OpenAI | - | 66.0% | - | - | - | 66.0% |
| GPT-5.1-Codex OpenAI | - | 66.0% | - | - | - | 66.0% |
| Kimi K2 0711 Moonshot AI | 65.8% | - | - | - | - | 65.8% |
| Claude 4.5 Haiku Anthropic | - | 66.6% | - | 64.7% | - | 65.6% |
| Kimi K2 Moonshot AI | - | 65.4% | - | - | - | 65.4% |
| o1 Preview OpenAI | - | 64.6% | - | - | - | 64.6% |
| DeepSeek V3.2 DeepSeek | - | 70.0% | - | 59.0% | - | 64.5% |
| GLM 4.5 Zai | - | 64.2% | - | - | - | 64.2% |
| Kimi K2 Thinking Kimi | - | 63.4% | - | - | - | 63.4% |
| MiniMax M2 Minimax | - | 61.0% | - | - | - | 61.0% |
| Claude 4 Sonnet Anthropic | - | 74.6% | 58.3% | - | 35.6% | 56.2% |
| o3 OpenAI | 71.7% | 58.4% | - | - | 36.0% | 55.4% |
| Qwen3 Coder 30B A3B Instruct Alibaba | - | 60.4% | 49.7% | - | - | 55.0% |
| Gemini 2.5 Pro Google (Alphabet Inc.) | - | 53.6% | - | - | - | 53.6% |
| GPT-5 Mini OpenAI | - | 59.8% | - | 39.7% | - | 49.8% |
| o4 Mini OpenAI | - | 64.6% | - | - | 33.9% | 49.2% |
| Claude Sonnet 3.7 Anthropic | - | 66.4% | 48.0% | - | 31.3% | 48.6% |
| Qwen2.5 Coder 32B Instruct Alibaba | - | 47.0% | - | - | - | 47.0% |
| Devstral Small (May '25) Mistral AI | - | 46.8% | - | - | - | 46.8% |
| Claude Sonnet 3.5 Anthropic | - | 63.4% | 51.3% | - | 25.3% | 46.7% |
| Gemini 2.0 Flash (Feb '25) Google (Alphabet Inc.) | - | 44.2% | - | - | - | 44.2% |
| DeepSeek V3 0324 DeepSeek | - | 42.0% | - | - | - | 42.0% |
| Claude 3 Haiku Anthropic | - | 40.6% | - | - | - | 40.6% |
| GPT-4o (2024-05-13) OpenAI | - | 38.8% | 38.0% | - | - | 38.4% |
| o3 Mini OpenAI | - | 42.4% | 32.3% | - | - | 37.4% |
| DeepSeek V3 DeepSeek | - | - | 36.7% | - | - | 36.7% |
| GPT-4.1 OpenAI | - | 39.6% | - | - | 31.1% | 35.4% |
| GPT-5 Nano OpenAI | - | 34.8% | - | - | - | 34.8% |
| GPT-4o (2024-08-06) OpenAI | - | 32.6% | 39.7% | - | 30.4% | 34.2% |
| Gemini 2.5 Flash Google (Alphabet Inc.) | - | 28.7% | - | - | - | 28.7% |
| gpt-oss-120b OpenAI | - | 26.0% | - | - | - | 26.0% |
| GPT-4 OpenAI | - | 22.4% | 28.3% | - | - | 25.4% |
| GPT-4.1 Mini OpenAI | - | 23.9% | - | - | - | 23.9% |
| GPT-4o OpenAI | - | 21.6% | - | - | - | 21.6% |
| Claude Opus 3 Anthropic | - | 15.8% | 4.3% | - | - | 10.1% |
51 / 51 models