Measuring Massive Multitask Language Understanding
Introduces MMLU, a 57-subject multiple-choice exam that became the de facto general-knowledge benchmark for LLMs.
- Publisher
- University of California, Berkeley
- Year
- 2020
- Venue
- ICLR
- Authors
- 7
- Hosting
- External sourcelicense unknown
Cite
Notes
Only stored in your browser.
Introduces 1 artifact - 1 eval
TL;DR
Semantic Scholar
While most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average, however, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy.