EvalPlus Leaderboard

Name: EvalPlus Leaderboard
Creator: EvalPlus Team

The EvalPlus team's joint HumanEval+ / MBPP+ leaderboard ranking LLMs on code-generation benchmarks expanded with 35-80× more test cases.

Operator: EvalPlus Team
Kind: Aggregated
Updates: monthly·updated 1mo ago
Notable for: The reference leaderboard for "did your HumanEval/MBPP score survive the +-test expansion" — the canonical correction for inflated code-gen numbers from the original benchmarks.
URL: evalplus.github.io/leaderboard.html
Tracks: 4 evals · aggregated

Cite

Notes

Only stored in your browser.

Attribution

Per-eval breakdown

models

Model	↗	↗	↗	↗
GPT-5 Nano OpenAI	-	100.0%	-	-	100.0%
Qwen2.5 Coder 32B Instruct Alibaba	92.1%	90.5%	87.2%	77.0%	86.7%
Gemini 1.5 Pro 002 Google (Alphabet Inc.)	89.0%	89.7%	79.3%	74.6%	83.2%
Grok Beta xAI	88.4%	86.0%	80.5%	65.6%	80.1%
GPT-4o-mini OpenAI	-	78.0%	-	-	78.0%
Gemini 1.5 Flash 002 Google (Alphabet Inc.)	82.3%	84.7%	75.6%	67.5%	77.5%
Llama 3 Instruct 70B Meta Platforms	77.4%	82.3%	72.0%	69.0%	75.2%
CodeLlama 70B Instruct Meta Platforms	72.0%	-	65.9%	-	69.0%
Qwen1.5 72B Chat Alibaba	68.3%	72.5%	59.1%	61.6%	65.4%
Command R Cohere	64.0%	74.3%	56.7%	63.5%	64.6%
Llama 3.1 8B Instruct Meta Platforms	69.5%	68.3%	62.8%	55.6%	64.0%
Phi 3 Mini 4k Instruct Microsoft	64.6%	65.9%	59.1%	54.2%	60.9%
Llama 3 8B Instruct Meta Platforms	61.6%	64.6%	56.7%	54.8%	59.4%
Gemma 1.1 7B It Google (Alphabet Inc.)	42.7%	57.1%	35.4%	45.0%	45.1%
Gemma 7B It Google (Alphabet Inc.)	28.7%	47.1%	25.0%	36.8%	34.4%
Gemma 1.1 2B It Google (Alphabet Inc.)	22.6%	29.8%	17.7%	23.3%	23.4%
Vicuna 13B LMSYS Org	17.1%	-	15.9%	-	16.5%
Gemma 2B It Google (Alphabet Inc.)	17.7%	-	15.2%	-	16.4%
Vicuna 7B	11.6%	-	11.6%	-	11.6%

19 / 19 models