math

Category: reasoning
Slug: math
Evals: 12
Tools: 58
Models: 539
Papers: 8

Evals testing this capability

AIME 2024: Problems from the American Invitational Mathematics Examination

Mathematical Association of America

Official 15-problem high-school math olympiad-track exam used by labs as a fresh, contamination-resistant math reasoning benchmark.

ActiveMathPlanning

BIG-Bench Hard (BBH)

Google Research

23 challenging multi-step reasoning tasks distilled from BIG-Bench where prior models underperformed average humans.

SaturatedPlanningScientific ReasoningMath

BIG-Bench

Google DeepMind

204 diverse tasks contributed by 450 researchers at 132 institutions - the original "test everything" LLM benchmark.

SaturatedFactual RecallPlanningMath

FrontierMath

Epoch AI

Unpublished collection of research-level mathematics problems written by professional mathematicians, designed to be the hardest open math benchmark.

ActiveMathScientific Reasoning

GSM8K

OpenAI

8.5k grade-school math word problems requiring multi-step arithmetic reasoning to reach a single numeric answer.

SaturatedMathPlanning

Humanity's Last Exam (HLE)

Center for AI Safety (CAIS)

2,500 expert-authored questions across math, sciences, and humanities designed to be the hardest closed-ended benchmark for frontier models.

ActiveScientific ReasoningMathFactual Recall

LiveBench

Meta FAIR (Fundamental AI Research)

Rolling contamination-free benchmark that updates questions monthly across math, coding, reasoning, language, instruction-following, and data analysis.

ActiveMathCode GenerationInstruction Following

MATH-500

OpenAI

500-problem subset of the Hendrycks MATH competition-math benchmark, popularized by OpenAI's PRM800K work as a standard evaluation slice.

SaturatedMathPlanning

MATH

University of California, Berkeley

12,500 high-school competition math problems with full LaTeX-formatted step-by-step solutions, spanning algebra through number theory.

SaturatedMath

MathVista

University of California, Berkeley

6,141 multimodal math problems combining diagrams, charts, and figures - the canonical "math + vision" benchmark.

ActiveMathImage UnderstandingMultimodal

MGSM (Multilingual GSM8K)

Google DeepMind

Hand-translated 250-problem subset of GSM8K in 10 languages - a multilingual grade-school math benchmark.

ActiveMathMultilingual

PRM800K

OpenAI

800,000 step-level human labels on GPT-4 solutions to MATH problems - the canonical process-reward training/eval dataset.

ActiveMathLLM Judging

Tools lifting evals here

View all

NuminaMath

Numina

An 860k-problem competition-math dataset with detailed solutions, the open community's go-to corpus for training math-specialized LLMs.

SFT DatasetMathScientific Reasoning

lifts 4 evals here

VF Openbench RL Env (Community)

Environment for single-turn tasks in OpenBench

RL Env

lifts 4 evals here

Verifiers Math (math-python)

Prime Intellect

Multi-turn math problem-solving environment where the model proposes Python code in a sandbox to compute and verify numerical answers.

RL EnvMathTool CallingCode Generation

lifts 3 evals here

Bigbench BBH RL Env (Prime Community)

Prime Community

Big Bench + BBH implementation

RL EnvBigbenchBbhNLP

lifts 2 evals here

MATH Group RL Env (Prime Intellect)

Prime Intellect

Math group environment

RL EnvMathGsm8k

lifts 2 evals here

OpenThoughts

Open Thoughts

A fully-open distillation of long DeepSeek-R1 reasoning traces - the community's flagship "open R1" SFT corpus for reasoning models.

SFT DatasetMathCode GenerationScientific Reasoning

lifts 2 evals here

s1K

Stanford Center for Research on Foundation Models (CRFM)

Stanford's hand-curated 1,000-problem reasoning dataset that, paired with budget forcing at inference, produced o1-competitive results for ~$50 of compute.

SFT DatasetMathScientific Reasoning

lifts 2 evals here

AIME 2024 RL Env (Prime Intellect)

Prime Intellect

AIME-24 evaluation environment

Evals testing this capability

AIME 2024: Problems from the American Invitational Mathematics Examination

BIG-Bench Hard (BBH)

BIG-Bench

FrontierMath

GSM8K

Humanity's Last Exam (HLE)

LiveBench

MATH-500

MATH

MathVista

MGSM (Multilingual GSM8K)

PRM800K

Tools lifting evals here

NuminaMath

VF Openbench RL Env (Community)

Verifiers Math (math-python)

Bigbench BBH RL Env (Prime Community)

MATH Group RL Env (Prime Intellect)

OpenThoughts

s1K

AIME 2024 RL Env (Prime Intellect)

Aya Dataset

BBH RL Env (Community)

Certainty Collapse RL Env (Community)

Compositional Hacks RL Env (Community)

COT Theater RL Env (Community)

Deepconf RL Env (Community)

Deepscaler MATH RL Env (Prime Intellect)

Deepscaler RL Env (Prime Intellect)

Defend Concede RL Env (Community)

Discover Gsm8k RL Env (Community)

Doublecheck RL Env (Community)

Doublecheck RL Env (Prime Intellect)

Emergence Prediction RL Env (Community)

Emoji HACK RL Env (Community)

FH Aviary RL Env (Prime Community)

FH Aviary RL Env (Prime Intellect)

Formatting Emergence RL Env (Community)

Goodsirmath8k RL Env (Kunumi)

Gsm8k Multireward RL Env (Community)

Gsm8k Olmes RL Env (Community)

Gsm8k RL Env (Community)

Gsm8k RL Env (Community)

Gsm8k RL Env (Dev Team)

Gsm8k RL Env (Prime Intellect)

Gsm8k RL Env (Sarvam AI Team)

Hendrycks MATH RL Env (Community)

Hendrycks MATH RL Env (Prime Intellect)

Hendrycksmath RL Env (Community)

Hermes Example RL Env (Community)

HLE RL Env (Prime Intellect)

LAST Number RL Env (Community)

Length HACK RL Env (Community)

LLM Trainer RL Env (Community)

MATH 500 RL Env (Community)

MATH 500 RL Env (Prime Intellect)

MATH Python RL Env (Community)

MATH Python RL Env (Prime Intellect)

MATH Reasoning RL Env (Community)

Multimodal RL Env (Community)

OpenOrca

P2p Gsm8k RL Env (Sarvam AI Team)

Reasoning HACK RL Env (Community)

Regex QC RL Env (Community)

Sensible Thinker RL Env (Community)

SlimOrca

Tülu 3 SFT Mixture

WEB PY RL Env (Community)

WEB PY RL Env (Prime Community)

WEB PY RL Env (Prime Intellect)

WizardLM Evol-Instruct

Top models on this capability

Papers in this area

Related in reasoning