What capabilities does Massive Multitask Language Understanding (MMLU) test?

Massive Multitask Language Understanding (MMLU) evaluates factual recall, scientific reasoning.

What is the current top score on Massive Multitask Language Understanding (MMLU)?

The top reported score is 83.1 by Tülu 3 70B, across 2 models reporting (1 from frontier labs).

How can a model improve its Massive Multitask Language Understanding (MMLU) score?

Tools linked to Massive Multitask Language Understanding (MMLU) on Sophon include MMLU RL Env (Prime Community), MMLU RL Env (Community), VF Openbench RL Env (Community), Openmed Medknowledge RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.

What license is Massive Multitask Language Understanding (MMLU) under?

Massive Multitask Language Understanding (MMLU) is available under MIT.

Massive Multitask Language Understanding (MMLU)

57-subject multiple-choice exam testing broad world knowledge and reasoning across academic and professional domains.

Open

Publisher: University of California, Berkeley
Capabilities: Factual Recall Scientific Reasoning
Format: HF Dataset
Size: 15908 tasks
License: MIT
Published: Sep 2020
Notable for: Benchmark for evaluating factual recall and scientific reasoning.
Canonical: github.com/hendrycks/test
Also on: huggingface.co/datasets/cais/mmlu

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: Vaultprime-hub

Attribution policy →

Top score 83.1 by Tülu 3 70B - 2 models reporting (1 frontier)

Score history

Top models

Massive Multitask Language Understanding (MMLU)Bar chart with 2 bars. Highest value: Tülu 3 70B at 83.1.

2 models

Where it's ranked

Open LLM Leaderboard

Hugging Face

Aggregated

aggregated with 6 others · live

Arena Overall

LMArena

Human preference

preference voting · live

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

MMLU RL Env (Prime Community)

Prime Community

MMLU evaluator for multi-subject multiple-choice reasoning.

ImplementationRL EnvGeneral KnowledgeNLP

MMLU RL Env (Community)

MMLU evaluator for multi-subject multiple-choice reasoning.

ImplementationRL EnvGeneral KnowledgeNLP

VF Openbench RL Env (Community)

Environment for single-turn tasks in OpenBench

Trains towardRL Env

Openmed Medknowledge RL Env (Community)

Comprehensive medical knowledge MCQ environment using MMLU medical subsets.

Trains towardRL Env

Capybara

LDJnr

LDJnr's multi-turn reasoning dataset built with the Amplify-Instruct synthesis method - short but deep conversations on a single topic.

Training dataSFT DatasetMulti Turn DialogInstruction FollowingScientific Reasoning

OpenHermes 2.5

Teknium

Teknium's million-row aggregation of high-quality GPT-4-style synthetic instructions that became the de facto open SFT baseline of 2023-2024.

Training dataSFT DatasetInstruction FollowingMulti Turn DialogCode Generation

OpenOrca

OpenOrca Team

An open reproduction of Microsoft's Orca recipe - FLAN prompts with GPT-4 chain-of-thought completions that taught reasoning by imitation.

Training dataSFT DatasetInstruction FollowingMathScientific Reasoning

ShareGPT

Anonymous Community

The original community-scraped corpus of ChatGPT conversations that bootstrapped Vicuna and the entire open-instruction-tuning era.

Training dataSFT DatasetMulti Turn DialogInstruction Following

SlimOrca

OpenOrca Team

A heavily-deduplicated, GPT-4-only slice of OpenOrca that delivers similar downstream quality at one-third the size.

Training dataSFT DatasetInstruction FollowingMathScientific Reasoning

Tülu 3 SFT Mixture

Allen Institute for AI (Ai2)

Allen AI's flagship open SFT mixture combining new persona-driven prompts with curated public data for post-training a frontier-quality instruct model.

Training dataSFT DatasetInstruction FollowingMathCode Generation

Papers

Measuring Massive Multitask Language Understanding

ICLR · 2020

Introduces MMLU, a 57-subject multiple-choice exam that became the de facto general-knowledge benchmark for LLMs.

introduces

Measuring Massive Multitask Language Understanding

ICLR · 2020

Introduces MMLU, a 57-subject multiple-choice exam that became the de facto general-knowledge benchmark for LLMs.

Contributors

DDan Hendrycks

FAQ

What is Massive Multitask Language Understanding (MMLU)?: 57-subject multiple-choice exam testing broad world knowledge and reasoning across academic and professional domains.
What capabilities does Massive Multitask Language Understanding (MMLU) test?: Massive Multitask Language Understanding (MMLU) evaluates factual recall, scientific reasoning.
What is the current top score on Massive Multitask Language Understanding (MMLU)?: The top reported score is 83.1 by Tülu 3 70B, across 2 models reporting (1 from frontier labs).
How can a model improve its Massive Multitask Language Understanding (MMLU) score?: Tools linked to Massive Multitask Language Understanding (MMLU) on Sophon include MMLU RL Env (Prime Community), MMLU RL Env (Community), VF Openbench RL Env (Community), Openmed Medknowledge RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
What license is Massive Multitask Language Understanding (MMLU) under?: Massive Multitask Language Understanding (MMLU) is available under MIT.