MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

TIGER-Lab benchmark that upgrades MMLU with harder reasoning-heavy questions, 10 answer choices, and de-noised options for a higher ceiling.

Open

Publisher: TIGER-Lab
Year: 2024
Venue: NeurIPS
ArXiv: arxiv.org/abs/2406.01574
Code: github.com/TIGER-AI-Lab/MMLU-Pro
Authors: 17
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2406.01574
TL;DR: semanticscholar.org/paper/1406bb4cb6801bc4767b661308118c888a9b09da
Code: github.com/TIGER-AI-Lab/MMLU-Pro

Attribution policy →

Introduces 1 artifact - 1 eval

TL;DR

Semantic Scholar

MMLU-Pro is introduced, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options, indicating that MMLU-Pro includes more complex reasoning questions.

Artifacts

Evals

MMLU-Pro

Authors

Aaran Arulraj Abhranil Chandra Alex Zhuang Ge Zhang Kai Wang Max Ku Rongqi Fan Shiguang Guo Tianle Li Weiming Ren Wenhu Chen Xiang Yue Xuan He Xueguang Ma Yuansheng Ni Yubo Wang Ziyan Jiang