AIME-26
Overview
- Environment ID:
aime2026 - Short description: AIME 2026 problems (AIME I/II) evaluated single-turn with CoT and boxed numeric answers.
- Tags: math, aime, 2026, single-turn, boxed-answer
Datasets
- Primary dataset(s): MathArena/aime_2026 loaded directly via
load_datasetand pinned to revision10b4e45b7a503075d4da8a0d57916a4f06ce6bd2 - Split sizes: Defaults to split
train(N=30)
Task
- Type: single-turn
- Parser:
MaybeThinkParserwrappingextract_boxed_answer - Rubric overview: Exact-match on parsed boxed answer (single criterion, weight 1.0).
Quickstart
Run an evaluation with default settings:
uv run vf-eval aime2026
Configure model and sampling:
uv run vf-eval aime2026 \
-m gpt-4.1-mini \
-n 20 -r 3 -t 1024 -T 0.7
Notes:
- Use
-a/--env-argsto pass environment-specific configuration as a JSON object.
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
system_prompt | str or None | None | System prompt shown to the model |
instruction_prompt_pre | str | "Solve the following math problem..." | Prefix prepended to each question |
instruction_prompt_post | str | "" | Suffix appended to each question |
Metrics
| Metric | Meaning |
|---|---|
reward | 1.0 if parsed boxed answer equals target, else 0.0 |
Changelog
v0.1.3
- Fix scoring bug: explicitly cast
answercolumn to string type in dataset mapping to prevent HuggingFace Datasets from coercingstr(int(...))back toint64, which caused'int' object is not subscriptableerrors in the rubric
v0.1.2
- Pin HuggingFace dataset loading to a fixed revision and set
trust_remote_code=False
v0.1.1
- Bump verifiers to v0.1.12.dev1: perf improvements to
MathRubric; now usesextract_boxed_answerin strict mode — if no\boxed{}answer is found the parsed answer is""which always scores 0, preventing false positives where the model is rewarded for containing the correct answer anywhere in the response
v0.1.0
- Initial release