0

MMLU ProX

Fresh

MMLU-ProX is a comprehensive benchmark for assessing cross-linguistic reasoning in LLMs across 29 languages, built on an English benchmark with each language version containing 11,829 identical questions (and a lite version of 658 questions per language) to enable direct compa…

Type
RL Env
Runtime
ORS
License
unknown
Size
343041 tasks
Published
Feb 2026

Cite

Notes

Only stored in your browser.

MMLU-ProX

OpenReward Environment Hugging Face

Description

MMLU-ProX is an environment for evaluating agents on multilingual multiple-choice question answering. It is based on the MMLU-ProX dataset from HuggingFace (li-lab/MMLU-ProX), which extends MMLU-Pro to 29 languages. Each task presents a question with 10 answer options (A through J) across 14+ subject categories. Grading is deterministic via exact match.

Capabilities

  • Multilingual multiple-choice question answering across 29 languages
  • Knowledge reasoning across 14+ subject categories (mathematics, science, health, business, humanities, etc.)
  • Single-turn evaluation with deterministic grading

Compute Requirements

MMLU-ProX extends Environment directly and does not require a sandbox. It has minimal compute requirements.

License

MIT.

Tasks

There are 58 splits (29 languages x 2 split types) in the format {language}_{split}:

  • Validation: 70 examples per language (2,030 total)
  • Test: ~11,800 examples per language (~341,011 total)
  • Total: 343,041 examples

Languages: af, ar, bn, cs, de, en, es, fr, hi, hu, id, it, ja, ko, mr, ne, pt, ru, sr, sw, te, th, uk, ur, vi, wo, yo, zh, zu.

Questions span 14+ subject areas including mathematics, science, health, business, humanities, computer science, law, and more.

Reward Structure

This is a sparse, verifiable reward environment with binary scoring. The agent calls submit_answer once with a letter (A-J). The answer is compared via exact match against the correct answer:

  • Correct: Reward 1.0.
  • Incorrect: Reward 0.0.

We do not use LLM graders for this task.

Data

Questions are sourced from the li-lab/MMLU-ProX HuggingFace dataset, consolidated into a single parquet file for efficient loading via predicate pushdown. Data files are stored on the OpenReward platform.

Tools

Agents are given a single tool:

  • submit_answer: Submit an answer letter (A through J) for the current question. Returns whether the answer is correct. This tool can only be called once per task.

Time Horizon

MMLU-ProX is a single-turn environment. The agent receives a question with 10 options and submits one answer. Each task requires exactly one tool call.

Environment Difficulty

Model performance on MMLU-ProX from the original paper (5-shot CoT):

ModelEnglishSwahili
QwQ-32B70.7%32.8%
Qwen2.5-72B70.3%40.1%
Llama3.1-405B68.8%52.1%

Performance degrades significantly from high-resource to low-resource languages, with gaps of up to 30% between English and Swahili.

Other Environment Requirements

There are no further environment requirements; MMLU-ProX works out of the box with the OpenReward endpoint without any secrets.

Safety

Agents in MMLU-ProX are asked to answer multiple-choice knowledge questions. The environment does not present direct safety risks, as agents only provide letter answers with no access to external systems, tools, or the internet.

Citation

@inproceedings{xuan2025mmluprox,
  title={MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation},
  author={Xuan, Weihao and Yang, Rui and Qi, Heli and Zeng, Qingcheng and Xiao, Yunze and Feng, Aosong and Liu, Dairui and Xing, Yun and Wang, Junjue and Gao, Fan and others},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2025},
  url={https://arxiv.org/abs/2503.10497}
}