AIME-26

Environment ID: aime2026
Short description: AIME 2026 problems (AIME I/II) evaluated single-turn with CoT and boxed numeric answers.
Tags: math, aime, 2026, single-turn, boxed-answer

Primary dataset(s): MathArena/aime_2026 loaded directly via load_dataset and pinned to revision 10b4e45b7a503075d4da8a0d57916a4f06ce6bd2
Split sizes: Defaults to split train (N=30)

Type: single-turn
Answer extraction: extract_boxed_answer on the final assistant message
Reward overview: the math_verify reward compares the boxed answer to the target (single criterion, weight 1.0).

Run an evaluation with default settings:

uv run vf-eval aime2026

Configure model and sampling:

uv run vf-eval aime2026 \
  -m gpt-4.1-mini \
  -n 20 -r 3 -t 1024 -T 0.7

Notes:

Use -a / --env-args to pass environment-specific configuration as a JSON object.

None. The dataset, prompt, and grading settings are fixed for this benchmark.

Metric	Meaning
`reward`	Weighted sum of rewards (here, `math_verify`)
`math_verify`	1.0 if the boxed answer matches the target under `math_verify`, else 0.0

Fix scoring bug: explicitly cast answer column to string type in dataset mapping to prevent HuggingFace Datasets from coercing str(int(...)) back to int64, which caused 'int' object is not subscriptable errors in the rubric

Pin HuggingFace dataset loading to a fixed revision and set trust_remote_code=False

Bump verifiers to v0.1.12.dev1: perf improvements to MathRubric; now uses extract_boxed_answer in strict mode — if no \boxed{} answer is found the parsed answer is "" which always scores 0, preventing false positives where the model is rewarded for containing the correct answer anywhere in the response