Evaluating Large Language Models Trained on Code

OpenAI's foundational Codex paper introducing HumanEval, the pass@k metric, and the Codex model line that powered the original GitHub Copilot.

Open

Publisher: OpenAI
Year: 2021
Venue: preprint
ArXiv: arxiv.org/abs/2107.03374
Code: github.com/openai/human-eval
Authors: 58
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2107.03374
TL;DR: semanticscholar.org/paper/acbdbf49f9bc3f151b93d9ca9a06009f4f6eb269
Code: github.com/openai/human-eval

Attribution policy →

TL;DR

Semantic Scholar

It is found that repeated sampling from the GPT language model is a surprisingly effective strategy for producing working solutions to difficult prompts, and the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics are discussed.