Measuring Massive Multitask Language Understanding

Introduces MMLU, a 57-subject multiple-choice exam that became the de facto general-knowledge benchmark for LLMs.

Open

Preview
Publisher: University of California, Berkeley
Year: 2020
Venue: ICLR
ArXiv: arxiv.org/abs/2009.03300
Code: github.com/hendrycks/test
Authors: 7
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2009.03300
TL;DR: semanticscholar.org/paper/814a4f680b9ba6baba23b93499f4b48af1a27678
Code: github.com/hendrycks/test

Attribution policy →

Introduces 1 artifact - 1 eval

TL;DR

Semantic Scholar

While most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average, however, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy.

Artifacts

Evals

Massive Multitask Language Understanding (MMLU)

Authors

Andy Zou Collin Burns Dan Hendrycks Dawn Song Jacob Steinhardt Mantas Mazeika Steven Basart