0

BeyondAIME

Fresh

BeyondAIME is a benchmark for evaluating generalized STEM reasoning that extends AIME-style problems to probe deep, stepwise mathematical problem-solving.

Type
RL Env
Runtime
ORS
License
unknown
Size
100 tasks
Published
Feb 2026

Cite

Notes

Only stored in your browser.

BeyondAIME

OpenReward Environment Hugging Face Dataset

Description

BeyondAIME is an environment for evaluating advanced mathematical reasoning on 100 competition-level problems with difficulty at or above AIME problems #11-15. All problems have been manually revised to be unique and contamination-resistant, focusing on reasoning rather than domain-specific knowledge. Answers are integers, enabling unambiguous automated evaluation.

Capabilities

  • Advanced competition-level mathematical reasoning
  • Integer answer validation with exact match
  • Problems spanning algebra, number theory, combinatorics, and geometry

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

CC0 1.0 Universal (Public Domain).

Tasks

There is one split in this environment:

  • test: 100 tasks

Each problem requires an integer answer (range: 3 to 33,124,147).

Reward Structure

Single-turn evaluation with deterministic grading. The agent submits an integer answer via the submit_answer tool. The submitted answer is compared via exact integer match against the ground truth. Reward is 1.0 if correct, 0.0 if incorrect.

Data

beyondaime_test.parquet (100 problems) sourced from HuggingFace ByteDance-Seed/BeyondAIME. Stored on the OpenReward platform.

Tools

ToolDescription
submit_answerSubmit an integer answer. Deterministic evaluation via exact match. Ends the episode.

Time Horizon

Single-turn. The agent reads the problem and submits one integer answer.

Environment Difficulty

BeyondAIME is designed to be significantly harder than standard AIME problems, with difficulty at or above AIME #11-15. Evaluation results from ByteDance-Seed:

ModelAccuracy
OpenAI o3-mini63.6%
Gemini 2.5 Pro58.8%
Seed-Thinking-v1.548.0%
DeepSeek R142.4%

Other Environment Requirements

There are no further environment requirements; BeyondAIME works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in BeyondAIME solve advanced mathematics problems in a standard environment. The environment does not present direct safety risks.

Citation

@misc{beyondaime2025,
  title={BeyondAIME: Advancing Math Reasoning Evaluation Beyond High School Olympiads},
  author={ByteDance-Seed},
  year={2025},
  url={https://huggingface.co/datasets/ByteDance-Seed/BeyondAIME}
}