What capabilities does HumanEval test?

HumanEval evaluates code generation.

What is the current top score on HumanEval?

The top reported score is 92.1% by Qwen2.5 Coder 32B Instruct, across 17 models reporting (1 from frontier labs).

What license is HumanEval under?

HumanEval is available under MIT.

HumanEval

164 hand-written Python programming problems with unit tests, the original LLM code-generation benchmark from OpenAI's Codex paper.

Open

Publisher: OpenAI
Capabilities: Code Generation
Domain: code
Format: HF Dataset
Size: 164 tasks
License: MIT
Published: Jul 2021
Notable for: Benchmark for evaluating code generation in the code domain.
Canonical: github.com/openai/human-eval
Also on: huggingface.co/datasets/openai_humaneval

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: EvalPlus

Attribution policy →

Top score 92.1% by Qwen2.5 Coder 32B Instruct - 17 models reporting (1 frontier)

Score history

Top models

HumanEvalBar chart with 17 bars. Highest value: Qwen2.5 Coder 32B Instruct at 92.1.

17 models

Where it's ranked

Open LLM Leaderboard

Hugging Face

Aggregated

aggregated with 6 others · live

EvalPlus Leaderboard

EvalPlus Team

Aggregated

aggregated with 3 others · monthly

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Humaneval RL Env (Community)

HumanEval code generation evaluation environment

ImplementationRL EnvCode

CODE Humaneval RL Env (Community)

HumanEval-style code generation with unit-test pass rate as reward.

ImplementationRL EnvCodeHumaneval

Openenv Coding RL Env (Meta FAIR (Fundamental AI Research))

Meta FAIR (Fundamental AI Research)

Sandboxed Python code-execution environment built on smolagents, exposing stdout/stderr/exit_code via the OpenEnv HTTP interface for closed-loop code-solver training.

Trains towardRL EnvCode GenerationTool CallingCode

Humaneval Multiturn RL Env (Community)

Multi-turn HumanEval - failed tests feed back as hints.

Trains towardRL EnvCodeHumaneval

Humaneval Tools RL Env (Community)

Tool-use HumanEval - code runner + test executor primitives.

Trains towardRL EnvCodeHumanevalTool Use

WizardLM Evol-Instruct

Microsoft

Microsoft's "Evol-Instruct" recipe - automatically rewriting simple instructions into harder, more diverse ones using an LLM evolver.

Training dataSFT DatasetInstruction FollowingMathCode Generation

Papers

Evaluating Large Language Models Trained on Code (pass@k formulation)

preprint · 2021

The Codex paper that defined pass@k, the unbiased estimator for code-generation success rate over multiple samples, now the universal scoring metric for code evals.

introduces

Evaluating Large Language Models Trained on Code

preprint · 2021

OpenAI's foundational Codex paper introducing HumanEval, the pass@k metric, and the Codex model line that powered the original GitHub Copilot.

Evaluating Large Language Models Trained on Code

preprint · 2021

The Codex paper that introduces HumanEval (164 hand-written Python problems) and the pass@k metric, and presents the model behind GitHub Copilot.

Evaluating Large Language Models Trained on Code (pass@k formulation)

preprint · 2021

The Codex paper that defined pass@k, the unbiased estimator for code-generation success rate over multiple samples, now the universal scoring metric for code evals.

Contributors

MMark Chen JJerzy "Jerry" Tworek

FAQ

What is HumanEval?: 164 hand-written Python programming problems with unit tests, the original LLM code-generation benchmark from OpenAI's Codex paper.
What capabilities does HumanEval test?: HumanEval evaluates code generation.
What is the current top score on HumanEval?: The top reported score is 92.1% by Qwen2.5 Coder 32B Instruct, across 17 models reporting (1 from frontier labs).
How can a model improve its HumanEval score?: Tools linked to HumanEval on Sophon include Humaneval RL Env (Community), CODE Humaneval RL Env (Community), Openenv Coding RL Env (Meta FAIR (Fundamental AI Research)), Humaneval Multiturn RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
What license is HumanEval under?: HumanEval is available under MIT.