0

EvalPlus Leaderboard

The EvalPlus team's joint HumanEval+ / MBPP+ leaderboard ranking LLMs on code-generation benchmarks expanded with 35-80× more test cases.

Kind
Aggregated
Updates
monthly·updated 3d ago
Notable for
The reference leaderboard for "did your HumanEval/MBPP score survive the +-test expansion" — the canonical correction for inflated code-gen numbers from the original benchmarks.
Tracks
4 evals · aggregated

Cite

Notes

Only stored in your browser.

Per-eval breakdown

19

models

Model
GPT-5 Nano

OpenAI

-100.0%--100.0%
Qwen2.5 Coder 32B Instruct

Alibaba

92.1%90.5%87.2%77.0%86.7%
Gemini 1.5 Pro 002

Google (Alphabet Inc.)

89.0%89.7%79.3%74.6%83.2%
Grok Beta

xAI

88.4%86.0%80.5%65.6%80.1%
GPT-4o-mini

OpenAI

-78.0%--78.0%
Gemini 1.5 Flash 002

Google (Alphabet Inc.)

82.3%84.7%75.6%67.5%77.5%
Llama 3 Instruct 70B

Meta Platforms

77.4%82.3%72.0%69.0%75.2%
CodeLlama 70B Instruct

Meta Platforms

72.0%-65.9%-69.0%
Qwen1.5 72B Chat

Alibaba

68.3%72.5%59.1%61.6%65.4%
Command R

Cohere

64.0%74.3%56.7%63.5%64.6%
Llama 3.1 8B Instruct

Meta Platforms

69.5%68.3%62.8%55.6%64.0%
Phi 3 Mini 4k Instruct

Microsoft

64.6%65.9%59.1%54.2%60.9%
Llama 3 8B Instruct

Meta Platforms

61.6%64.6%56.7%54.8%59.4%
Gemma 1.1 7B It

Google (Alphabet Inc.)

42.7%57.1%35.4%45.0%45.1%
Gemma 7B It

Google (Alphabet Inc.)

28.7%47.1%25.0%36.8%34.4%
Gemma 1.1 2B It

Google (Alphabet Inc.)

22.6%29.8%17.7%23.3%23.4%
Vicuna 13B

LMSYS Org

17.1%-15.9%-16.5%
Gemma 2B It

Google (Alphabet Inc.)

17.7%-15.2%-16.4%
Vicuna 7B11.6%-11.6%-11.6%
19 / 19 models

Evals tracked

4