MMLU-ProX

Description

MMLU-ProX is an environment for evaluating agents on multilingual multiple-choice question answering. It is based on the MMLU-ProX dataset from HuggingFace (li-lab/MMLU-ProX), which extends MMLU-Pro to 29 languages. Each task presents a question with 10 answer options (A through J) across 14+ subject categories. Grading is deterministic via exact match.

Capabilities

Multilingual multiple-choice question answering across 29 languages
Knowledge reasoning across 14+ subject categories (mathematics, science, health, business, humanities, etc.)
Single-turn evaluation with deterministic grading

Compute Requirements

MMLU-ProX extends Environment directly and does not require a sandbox. It has minimal compute requirements.

License

MIT.

Tasks

There are 58 splits (29 languages x 2 split types) in the format {language}_{split}:

Validation: 70 examples per language (2,030 total)
Test: ~11,800 examples per language (~341,011 total)
Total: 343,041 examples

Languages: af, ar, bn, cs, de, en, es, fr, hi, hu, id, it, ja, ko, mr, ne, pt, ru, sr, sw, te, th, uk, ur, vi, wo, yo, zh, zu.

Questions span 14+ subject areas including mathematics, science, health, business, humanities, computer science, law, and more.

Reward Structure

This is a sparse, verifiable reward environment with binary scoring. The agent calls submit_answer once with a letter (A-J). The answer is compared via exact match against the correct answer:

Correct: Reward 1.0.
Incorrect: Reward 0.0.

We do not use LLM graders for this task.

Data

Questions are sourced from the li-lab/MMLU-ProX HuggingFace dataset, consolidated into a single parquet file for efficient loading via predicate pushdown. Data files are stored on the OpenReward platform.

Tools

Agents are given a single tool:

submit_answer: Submit an answer letter (A through J) for the current question. Returns whether the answer is correct. This tool can only be called once per task.

Time Horizon

MMLU-ProX is a single-turn environment. The agent receives a question with 10 options and submits one answer. Each task requires exactly one tool call.

Environment Difficulty

Model performance on MMLU-ProX from the original paper (5-shot CoT):

Model	English	Swahili
QwQ-32B	70.7%	32.8%
Qwen2.5-72B	70.3%	40.1%
Llama3.1-405B	68.8%	52.1%

Performance degrades significantly from high-resource to low-resource languages, with gaps of up to 30% between English and Swahili.

Other Environment Requirements

There are no further environment requirements; MMLU-ProX works out of the box with the OpenReward endpoint without any secrets.

Safety

Agents in MMLU-ProX are asked to answer multiple-choice knowledge questions. The environment does not present direct safety risks, as agents only provide letter answers with no access to external systems, tools, or the internet.

Citation

@inproceedings{xuan2025mmluprox,
  title={MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation},
  author={Xuan, Weihao and Yang, Rui and Qi, Heli and Zeng, Qingcheng and Xiao, Yunze and Feng, Aosong and Liu, Dairui and Xing, Yun and Wang, Junjue and Gao, Fan and others},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2025},
  url={https://arxiv.org/abs/2503.10497}
}