0

EMMA

Fresh

EMMA (Enhanced MultiModal reAsoning) is a benchmark for assessing organic multimodal reasoning in MLLMs across mathematics, physics, chemistry, and coding.

Type
RL Env
Runtime
ORS
License
unknown
Size
2788 tasks
Published
Feb 2026

Cite

Notes

Only stored in your browser.

EMMA

⭐ OpenReward Environment Hugging Face Dataset

Description

EMMA (Enhanced MultiModal reAsoning) is an environment for evaluating expert-level multimodal reasoning across mathematics, physics, chemistry, and coding. Tasks require integrated visual and textual reasoning that cannot be solved by reasoning independently in each modality. Questions are multiple-choice with embedded images.

Capabilities

  • Multimodal reasoning across text and images
  • Expert-level problem solving in chemistry, coding, math, and physics
  • Interpreting diagrams, equations, charts, and scientific figures

Compute Requirements

This is a single-turn environment with no sandbox.

Tasks

There is one split in this environment:

  • Test: 2,788 tasks across four subjects:
    • Chemistry: 1,176
    • Coding: 564
    • Math: 892
    • Physics: 156

Each task presents a multiple-choice question (A–E) with one or more embedded images. The agent must select the correct answer letter.

Reward Structure

This is a single-turn environment with binary reward:

  • 1.0 — Correct answer letter selected
  • 0.0 — Incorrect answer

Evaluation is deterministic exact match on the answer letter (case-insensitive). No LLM grading is used. The environment accepts flexible input formats (e.g., "A", "A)", "A.").

Data

Data is stored as four Parquet files (one per subject: chemistry.parquet, coding.parquet, math.parquet, physics.parquet). Each file contains questions with embedded images (base64-encoded PNG), multiple-choice options, and correct answer letters. The environment uses lazy loading for memory efficiency.

Source: luckychao/EMMA

Tools

ToolDescription
submit_answerSubmit your answer letter (A, B, C, D, or E) for the multiple-choice question.

Time Horizon

EMMA is a single-turn environment. The agent receives a multimodal question and submits one answer letter for a total of one tool call.

Environment Difficulty

The original paper (ICML 2025 Oral) evaluates state-of-the-art MLLMs on EMMA:

ModelEMMA-miniFull EMMA
o145.75%-
Gemini 2.0 Flash Thinking-38.06%
Claude 3.5 Sonnet-37.23%
Human Expert77.75%-

Human experts outperform all models by 32+ percentage points. Chain-of-thought prompting shows divergent effects: improving closed-source models while reducing open-source model accuracy.

Other Environment Requirements

There are no further environment requirements; EMMA works out of the box with the OpenReward endpoint without any secrets.

Safety

This environment evaluates multimodal reasoning on academic problems and does not present direct safety risks.

Citations

@article{hao2025emma,
  author    = {Yunzhuo Hao and Jiawei Gu and Huichen Will Wang and Linjie Li and Zhengyuan Yang and Lijuan Wang and Yu Cheng},
  title     = {Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark},
  journal   = {arXiv preprint arXiv:2501.05444},
  year      = {2025},
  url       = {https://arxiv.org/abs/2501.05444}
}