Evaluating Large Language Models Trained on Code

The Codex paper that introduces HumanEval (164 hand-written Python problems) and the pass@k metric, and presents the model behind GitHub Copilot.

Open

Publisher: OpenAI
Year: 2021
Venue: preprint
ArXiv: arxiv.org/abs/2107.03374
Code: github.com/openai/human-eval
Authors: 6
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2107.03374
TL;DR: semanticscholar.org/paper/acbdbf49f9bc3f151b93d9ca9a06009f4f6eb269
Code: github.com/openai/human-eval

Attribution policy →

Introduces 1 artifact - 1 model

TL;DR

Semantic Scholar

It is found that repeated sampling from the GPT language model is a surprisingly effective strategy for producing working solutions to difficult prompts, and the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics are discussed.

Artifacts

Models

OpenAI Codex (2021)

Authors

Heewoo Jun Henrique Ponde de Oliveira Pinto Jerzy "Jerry" Tworek Mark Chen Qiming Yuan Wojciech Zaremba