0

MMMU

Fresh

MMMU is a benchmark for evaluating multimodal models on massive multi-discipline tasks that require college-level subject knowledge and deliberate reasoning.

Type
RL Env
Runtime
ORS
License
unknown
Size
10650 tasks
Published
Feb 2026

Cite

Notes

Only stored in your browser.

MMMU

OpenReward Environment Hugging Face Dataset

Description

MMMU (Massive Multi-discipline Multimodal Understanding) is an environment for evaluating college-level multimodal reasoning across 6 disciplines, 30 subjects, and 183 subfields. Each question includes up to 7 heterogeneous images (charts, diagrams, tables, chemical structures, music notation, etc.) and requires understanding complex visual and textual information.

Capabilities

  • College-level multimodal question answering
  • Up to 7 images per question with 30+ image types
  • Multiple-choice evaluation across expert-level reasoning tasks

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

Apache 2.0.

Tasks

Three splits in this environment:

  • dev: 150 tasks
  • validation: 900 tasks
  • test: 10,500 tasks

Total: 11,550 college-level questions spanning Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering.

Reward Structure

Single-turn evaluation with deterministic grading. The agent submits a single letter answer via the submit_answer tool. The submitted answer is compared via exact match against the ground truth. Reward is 1.0 if correct, 0.0 if incorrect.

Data

Parquet files (~605 MB total) for dev, validation, and test splits sourced from HuggingFace MMMU/MMMU. Stored on the OpenReward platform.

Tools

ToolDescription
submit_answerSubmit a single letter answer. Deterministic evaluation via exact match. Ends the episode.

Time Horizon

Single-turn. The agent reads the multimodal question (text and images) and submits one answer.

Environment Difficulty

MMMU evaluates college-level multimodal understanding:

ModelAccuracy
Gemini 3 Flash87.6%
Gemini 3 Pro87.5%
GPT-5.286.7%
Claude 4.5 Sonnet77.8%
Human Expert88.6%

Models have now surpassed the performance of average human experts (76.2%) but still trail top human experts (88.6%).

Other Environment Requirements

There are no further environment requirements; MMMU works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in MMMU answer college-level multimodal questions in a standard environment. The environment does not present direct safety risks.

Citation

@article{yue2023mmmu,
  title={MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI},
  author={Yue, Xiang and Ni, Yuansheng and Zhang, Kai and Zheng, Tianyu and Liu, Ruoqi and Zhang, Ge and Stevens, Samuel and Jiang, Dongfu and Ren, Weiming and Sun, Yuxuan and others},
  journal={arXiv preprint arXiv:2311.16502},
  year={2023}
}