EMMA

Description

EMMA (Enhanced MultiModal reAsoning) is an environment for evaluating expert-level multimodal reasoning across mathematics, physics, chemistry, and coding. Tasks require integrated visual and textual reasoning that cannot be solved by reasoning independently in each modality. Questions are multiple-choice with embedded images.

Capabilities

Multimodal reasoning across text and images
Expert-level problem solving in chemistry, coding, math, and physics
Interpreting diagrams, equations, charts, and scientific figures

Compute Requirements

This is a single-turn environment with no sandbox.

Tasks

There is one split in this environment:

Test: 2,788 tasks across four subjects:
- Chemistry: 1,176
- Coding: 564
- Math: 892
- Physics: 156

Each task presents a multiple-choice question (A–E) with one or more embedded images. The agent must select the correct answer letter.

Reward Structure

This is a single-turn environment with binary reward:

1.0 — Correct answer letter selected
0.0 — Incorrect answer

Evaluation is deterministic exact match on the answer letter (case-insensitive). No LLM grading is used. The environment accepts flexible input formats (e.g., "A", "A)", "A.").

Data

Data is stored as four Parquet files (one per subject: chemistry.parquet, coding.parquet, math.parquet, physics.parquet). Each file contains questions with embedded images (base64-encoded PNG), multiple-choice options, and correct answer letters. The environment uses lazy loading for memory efficiency.

Source: luckychao/EMMA

Tools

Tool	Description
`submit_answer`	Submit your answer letter (A, B, C, D, or E) for the multiple-choice question.

Time Horizon

EMMA is a single-turn environment. The agent receives a multimodal question and submits one answer letter for a total of one tool call.

Environment Difficulty

The original paper (ICML 2025 Oral) evaluates state-of-the-art MLLMs on EMMA:

Model	EMMA-mini	Full EMMA
o1	45.75%	-
Gemini 2.0 Flash Thinking	-	38.06%
Claude 3.5 Sonnet	-	37.23%
Human Expert	77.75%	-

Human experts outperform all models by 32+ percentage points. Chain-of-thought prompting shows divergent effects: improving closed-source models while reducing open-source model accuracy.

Other Environment Requirements

There are no further environment requirements; EMMA works out of the box with the OpenReward endpoint without any secrets.

Safety

This environment evaluates multimodal reasoning on academic problems and does not present direct safety risks.

Citations

@article{hao2025emma,
  author    = {Yunzhuo Hao and Jiawei Gu and Huichen Will Wang and Linjie Li and Zhengyuan Yang and Lijuan Wang and Yu Cheng},
  title     = {Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark},
  journal   = {arXiv preprint arXiv:2501.05444},
  year      = {2025},
  url       = {https://arxiv.org/abs/2501.05444}
}