0

Measuring Massive Multitask Language Understanding

Introduces MMLU, a 57-subject multiple-choice exam that became the de facto general-knowledge benchmark for LLMs.

Year
2020
Venue
ICLR
Authors
7
Hosting
External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Introduces 1 artifact - 1 eval

TL;DR

Semantic Scholar

While most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average, however, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy.

Artifacts

1

Authors

7