0

HumanEval+

Extended HumanEval with 80× more test cases (EvalPlus). Catches more edge-case bugs than the original.

Publisher
EvalPlus Team
Published
Apr 2023

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
EvalPlus
Attribution policy →

Top score 87.2% by Qwen2.5 Coder 32B Instruct - 17 models reporting (1 frontier)

Score history

6
10%33%55%78%100%Mar 23Aug 23Jan 24Jun 24Nov 24Vicuna 13BLlama 3 Instruct 70BGrok BetaQwen2.5 Coder 32B Instruct

Top models

17
HumanEval+Bar chart with 17 bars. Highest value: Qwen2.5 Coder 32B Instruct at 87.2.
17 models

Related tools

1
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

FAQ

What is HumanEval+?
Extended HumanEval with 80× more test cases (EvalPlus). Catches more edge-case bugs than the original.
What is the current top score on HumanEval+?
The top reported score is 87.2% by Qwen2.5 Coder 32B Instruct, across 17 models reporting (1 from frontier labs).
How can a model improve its HumanEval+ score?
Tools linked to HumanEval+ on Sophon include Humanevalplus RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.