0

HumanEval

164 hand-written Python programming problems with unit tests, the original LLM code-generation benchmark from OpenAI's Codex paper.

Publisher
OpenAI
Capabilities
Code Generation
Domain
code
Format
HF Dataset
Size
164 tasks
License
MIT
Published
Jul 2021
Notable for
Benchmark for evaluating code generation in the code domain.

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
EvalPlus
Attribution policy →

Top score 92.1% by Qwen2.5 Coder 32B Instruct - 17 models reporting (1 frontier)

Score history

6
10%33%55%78%100%Mar 23Aug 23Jan 24Jun 24Nov 24Vicuna 13BLlama 3 Instruct 70BGrok BetaQwen2.5 Coder 32B Instruct

Top models

17
HumanEvalBar chart with 17 bars. Highest value: Qwen2.5 Coder 32B Instruct at 92.1.
17 models

Where it's ranked

2

Related tools

6
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Papers

4

Contributors

2

FAQ

What is HumanEval?
164 hand-written Python programming problems with unit tests, the original LLM code-generation benchmark from OpenAI's Codex paper.
What capabilities does HumanEval test?
HumanEval evaluates code generation.
What is the current top score on HumanEval?
The top reported score is 92.1% by Qwen2.5 Coder 32B Instruct, across 17 models reporting (1 from frontier labs).
How can a model improve its HumanEval score?
Tools linked to HumanEval on Sophon include Humaneval RL Env (Community), CODE Humaneval RL Env (Community), Openenv Coding RL Env (Meta FAIR (Fundamental AI Research)), Humaneval Multiturn RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
What license is HumanEval under?
HumanEval is available under MIT.