Evaluating Large Language Models Trained on Code (pass@k formulation)

The Codex paper that defined pass@k, the unbiased estimator for code-generation success rate over multiple samples, now the universal scoring metric for code evals.

Open

Publisher: OpenAI
Year: 2021
Venue: preprint
ArXiv: arxiv.org/abs/2107.03374
Code: github.com/openai/human-eval
Authors: 4
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2107.03374
TL;DR: semanticscholar.org/paper/acbdbf49f9bc3f151b93d9ca9a06009f4f6eb269
Code: github.com/openai/human-eval

Attribution policy →

Introduces 1 artifact - 1 eval

TL;DR

Semantic Scholar

It is found that repeated sampling from the GPT language model is a surprisingly effective strategy for producing working solutions to difficult prompts, and the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics are discussed.

Artifacts

Evals

HumanEval

Authors

Heewoo Jun Jerzy "Jerry" Tworek Mark Chen Wojciech Zaremba