0

AIME 2026 RL Env (Prime Intellect)

Fresh

AIME-26 evaluation environment

Type
RL Env
Capabilities
Math
Runtime
single-turn
License
unknown
Size
v0.1.3
Published
Mar 2026

Cite

Notes

Only stored in your browser.

AIME-26

Source Code

Overview

  • Environment ID: aime2026
  • Short description: AIME 2026 problems (AIME I/II) evaluated single-turn with CoT and boxed numeric answers.
  • Tags: math, aime, 2026, single-turn, boxed-answer

Datasets

  • Primary dataset(s): MathArena/aime_2026 loaded directly via load_dataset and pinned to revision 10b4e45b7a503075d4da8a0d57916a4f06ce6bd2
  • Split sizes: Defaults to split train (N=30)

Task

  • Type: single-turn
  • Parser: MaybeThinkParser wrapping extract_boxed_answer
  • Rubric overview: Exact-match on parsed boxed answer (single criterion, weight 1.0).

Quickstart

Run an evaluation with default settings:

uv run vf-eval aime2026

Configure model and sampling:

uv run vf-eval aime2026 \
  -m gpt-4.1-mini \
  -n 20 -r 3 -t 1024 -T 0.7

Notes:

  • Use -a / --env-args to pass environment-specific configuration as a JSON object.

Environment Arguments

ArgTypeDefaultDescription
system_promptstr or NoneNoneSystem prompt shown to the model
instruction_prompt_prestr"Solve the following math problem..."Prefix prepended to each question
instruction_prompt_poststr""Suffix appended to each question

Metrics

MetricMeaning
reward1.0 if parsed boxed answer equals target, else 0.0

Changelog

v0.1.3

  • Fix scoring bug: explicitly cast answer column to string type in dataset mapping to prevent HuggingFace Datasets from coercing str(int(...)) back to int64, which caused 'int' object is not subscriptable errors in the rubric

v0.1.2

  • Pin HuggingFace dataset loading to a fixed revision and set trust_remote_code=False

v0.1.1

  • Bump verifiers to v0.1.12.dev1: perf improvements to MathRubric; now uses extract_boxed_answer in strict mode — if no \boxed{} answer is found the parsed answer is "" which always scores 0, preventing false positives where the model is rewarded for containing the correct answer anywhere in the response

v0.1.0

  • Initial release