HumanEval
164 hand-written Python programming problems with unit tests, the original LLM code-generation benchmark from OpenAI's Codex paper.
- Publisher
- OpenAI
- Capabilities
- Code Generation
- Domain
- code
- Format
- HF Dataset
- Size
- 164 tasks
- License
- MIT
- Published
- Jul 2021
- Notable for
- Benchmark for evaluating code generation in the code domain.
- Canonical
- github.com/openai/human-eval
Cite
Notes
Only stored in your browser.
Top score 92.1% by Qwen2.5 Coder 32B Instruct - 17 models reporting (1 frontier)
Score history
6Top models
17Where it's ranked
2Related tools
6Implementations, trainers, datasets and scaffolds linked to this eval.
Papers
4Contributors
2FAQ
- What is HumanEval?
- 164 hand-written Python programming problems with unit tests, the original LLM code-generation benchmark from OpenAI's Codex paper.
- What capabilities does HumanEval test?
- HumanEval evaluates code generation.
- What is the current top score on HumanEval?
- The top reported score is 92.1% by Qwen2.5 Coder 32B Instruct, across 17 models reporting (1 from frontier labs).
- How can a model improve its HumanEval score?
- Tools linked to HumanEval on Sophon include Humaneval RL Env (Community), CODE Humaneval RL Env (Community), Openenv Coding RL Env (Meta FAIR (Fundamental AI Research)), Humaneval Multiturn RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
- What license is HumanEval under?
- HumanEval is available under MIT.
