MMMU

Description

MMMU (Massive Multi-discipline Multimodal Understanding) is an environment for evaluating college-level multimodal reasoning across 6 disciplines, 30 subjects, and 183 subfields. Each question includes up to 7 heterogeneous images (charts, diagrams, tables, chemical structures, music notation, etc.) and requires understanding complex visual and textual information.

Capabilities

College-level multimodal question answering
Up to 7 images per question with 30+ image types
Multiple-choice evaluation across expert-level reasoning tasks

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

Apache 2.0.

Tasks

Three splits in this environment:

dev: 150 tasks
validation: 900 tasks
test: 10,500 tasks

Total: 11,550 college-level questions spanning Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering.

Reward Structure

Single-turn evaluation with deterministic grading. The agent submits a single letter answer via the submit_answer tool. The submitted answer is compared via exact match against the ground truth. Reward is 1.0 if correct, 0.0 if incorrect.

Data

Parquet files (~605 MB total) for dev, validation, and test splits sourced from HuggingFace MMMU/MMMU. Stored on the OpenReward platform.

Tools

Tool	Description
`submit_answer`	Submit a single letter answer. Deterministic evaluation via exact match. Ends the episode.

Time Horizon

Single-turn. The agent reads the multimodal question (text and images) and submits one answer.

Environment Difficulty

MMMU evaluates college-level multimodal understanding:

Model	Accuracy
Gemini 3 Flash	87.6%
Gemini 3 Pro	87.5%
GPT-5.2	86.7%
Claude 4.5 Sonnet	77.8%
Human Expert	88.6%

Models have now surpassed the performance of average human experts (76.2%) but still trail top human experts (88.6%).

Other Environment Requirements

There are no further environment requirements; MMMU works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in MMMU answer college-level multimodal questions in a standard environment. The environment does not present direct safety risks.

Citation

@article{yue2023mmmu,
  title={MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI},
  author={Yue, Xiang and Ni, Yuansheng and Zhang, Kai and Zheng, Tianyu and Liu, Ruoqi and Zhang, Ge and Stevens, Samuel and Jiang, Dongfu and Ren, Weiming and Sun, Yuxuan and others},
  journal={arXiv preprint arXiv:2311.16502},
  year={2023}
}