What is the current top score on MATH?

The top reported score is 100.0% by GPT-4.1 Mini, across 2 models reporting (2 from frontier labs).

How can a model improve its MATH score?

Tools linked to MATH on Sophon include Hendrycks MATH RL Env (Community), Hendrycks MATH RL Env (Prime Intellect), Verifiers Math (math-python), VF Openbench RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.

MATH

Saturated

12,500 high-school competition math problems with full LaTeX-formatted step-by-step solutions, spanning algebra through number theory.

Open

Publisher: University of California, Berkeley
Capabilities: Math
Domain: math
Format: HF Dataset
Size: 12500 tasks
License: MIT
Published: Mar 2021
Notable for: Benchmark for evaluating math in the math domain.
Canonical: github.com/hendrycks/math
Also on: huggingface.co/datasets/hendrycks/competition_math

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: prime-hub

Attribution policy →

Top score 100.0% by GPT-4.1 Mini - 2 models reporting (2 frontier)

Score history

Top models

MATHBar chart with 2 bars. Highest value: GPT-4.1 Mini at 100.

2 models

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Hendrycks MATH RL Env (Community)

MATH (Hendrycks) evaluation matching OLMES minerva_math::tulu methodology

ImplementationRL EnvMath

Hendrycks MATH RL Env (Prime Intellect)

Prime Intellect

Single-turn Hendrycks MATH-style problems with boxed numeric answers and CoT.

ImplementationRL EnvMath

Verifiers Math (math-python)

Prime Intellect

Multi-turn math problem-solving environment where the model proposes Python code in a sandbox to compute and verify numerical answers.

ImplementationRL EnvMathTool CallingCode Generation

VF Openbench RL Env (Community)

Environment for single-turn tasks in OpenBench

Trains towardRL Env

Hendrycksmath RL Env (Community)

hendrycksmath evaluation environment

Trains towardRL EnvMath

Doublecheck RL Env (Prime Intellect)

Prime Intellect

Test environment for double-checking math answers

Trains towardRL EnvDoublecheckMathHendrycks

MATH Group RL Env (Prime Intellect)

Prime Intellect

Math group environment

Trains towardRL EnvMathGsm8k

MATH Python RL Env (Prime Intellect)

Prime Intellect

Solve math problems using Python in a sandbox environment

Trains towardRL EnvTool UseMathPrime Sandboxes

Doublecheck RL Env (Community)

Test environment for double-checking math answers

Trains towardRL EnvDoublecheckMathHendrycks

MATH Python RL Env (Community)

Solve math problems using Python in a sandbox environment

Trains towardRL EnvTool UseMathPrime Sandboxes

Papers

Measuring Mathematical Problem Solving With the MATH Dataset

NeurIPS · 2021

Introduces the MATH benchmark of 12,500 competition-level math problems with step-by-step solutions, spanning algebra to number theory at high-school olympiad difficulty.

introduces

Measuring Mathematical Problem Solving With the MATH Dataset

NeurIPS · 2021

Introduces the MATH benchmark of 12,500 competition-level math problems with step-by-step solutions, spanning algebra to number theory at high-school olympiad difficulty.

Contributors

DDan Hendrycks

FAQ

What is MATH?: 12,500 high-school competition math problems with full LaTeX-formatted step-by-step solutions, spanning algebra through number theory.
What capabilities does MATH test?: MATH evaluates math.
What is the current top score on MATH?: The top reported score is 100.0% by GPT-4.1 Mini, across 2 models reporting (2 from frontier labs).
How can a model improve its MATH score?: Tools linked to MATH on Sophon include Hendrycks MATH RL Env (Community), Hendrycks MATH RL Env (Prime Intellect), Verifiers Math (math-python), VF Openbench RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
What license is MATH under?: MATH is available under MIT.