EvalPlus Leaderboard
The EvalPlus team's joint HumanEval+ / MBPP+ leaderboard ranking LLMs on code-generation benchmarks expanded with 35-80× more test cases.
- Operator
- EvalPlus Team
- Kind
- Aggregated
- Updates
- monthly·updated 3d ago
- Notable for
- The reference leaderboard for "did your HumanEval/MBPP score survive the +-test expansion" — the canonical correction for inflated code-gen numbers from the original benchmarks.
- Tracks
- 4 evals · aggregated
Cite
Notes
Only stored in your browser.
Per-eval breakdown
19models
| Model | |||||
|---|---|---|---|---|---|
| GPT-5 Nano OpenAI | - | 100.0% | - | - | 100.0% |
| Qwen2.5 Coder 32B Instruct Alibaba | 92.1% | 90.5% | 87.2% | 77.0% | 86.7% |
| Gemini 1.5 Pro 002 Google (Alphabet Inc.) | 89.0% | 89.7% | 79.3% | 74.6% | 83.2% |
| Grok Beta xAI | 88.4% | 86.0% | 80.5% | 65.6% | 80.1% |
| GPT-4o-mini OpenAI | - | 78.0% | - | - | 78.0% |
| Gemini 1.5 Flash 002 Google (Alphabet Inc.) | 82.3% | 84.7% | 75.6% | 67.5% | 77.5% |
| Llama 3 Instruct 70B Meta Platforms | 77.4% | 82.3% | 72.0% | 69.0% | 75.2% |
| CodeLlama 70B Instruct Meta Platforms | 72.0% | - | 65.9% | - | 69.0% |
| Qwen1.5 72B Chat Alibaba | 68.3% | 72.5% | 59.1% | 61.6% | 65.4% |
| Command R Cohere | 64.0% | 74.3% | 56.7% | 63.5% | 64.6% |
| Llama 3.1 8B Instruct Meta Platforms | 69.5% | 68.3% | 62.8% | 55.6% | 64.0% |
| Phi 3 Mini 4k Instruct Microsoft | 64.6% | 65.9% | 59.1% | 54.2% | 60.9% |
| Llama 3 8B Instruct Meta Platforms | 61.6% | 64.6% | 56.7% | 54.8% | 59.4% |
| Gemma 1.1 7B It Google (Alphabet Inc.) | 42.7% | 57.1% | 35.4% | 45.0% | 45.1% |
| Gemma 7B It Google (Alphabet Inc.) | 28.7% | 47.1% | 25.0% | 36.8% | 34.4% |
| Gemma 1.1 2B It Google (Alphabet Inc.) | 22.6% | 29.8% | 17.7% | 23.3% | 23.4% |
| Vicuna 13B LMSYS Org | 17.1% | - | 15.9% | - | 16.5% |
| Gemma 2B It Google (Alphabet Inc.) | 17.7% | - | 15.2% | - | 16.4% |
| Vicuna 7B | 11.6% | - | 11.6% | - | 11.6% |
19 / 19 models