What capabilities does MATH-500 test?

MATH-500 evaluates math, planning.

What is the current top score on MATH-500?

The top reported score is 99.2% by Grok 3 mini, across 178 models reporting (46 from frontier labs).

How can a model improve its MATH-500 score?

Tools linked to MATH-500 on Sophon include MATH 500 RL Env (Community), MATH 500 RL Env (Prime Intellect), VF Openbench RL Env (Community), NuminaMath - RL environments, datasets, and scaffolds that target this eval.

What license is MATH-500 under?

MATH-500 is available under MIT.

MATH-500

Saturated

500-problem subset of the Hendrycks MATH competition-math benchmark, popularized by OpenAI's PRM800K work as a standard evaluation slice.

Open

Publisher: OpenAI
Capabilities: Math Planning
Domain: math
Format: HF Dataset
Size: 500 tasks
License: MIT
Published: Mar 2021
Notable for: Benchmark for evaluating math and planning in the math domain.
Canonical: github.com/openai/prm800k
Also on: huggingface.co/datasets/HuggingFaceH4/MATH-500

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: AA prime-hub

Attribution policy →

Top score 99.2% by Grok 3 mini - 178 models reporting (46 frontier)

Score history

178

Top models

178

MATH-500Bar chart with 21 bars. Highest value: o3 at 99.2.

21 models

Where it's ranked

LiveBench

Abacus.AI

Aggregated

aggregated with 6 others · monthly

Open LLM Leaderboard

Hugging Face

Aggregated

aggregated with 6 others · live

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

MATH 500 RL Env (Community)

MATH-500 competition math environment with symbolic verification via math-verify

ImplementationRL EnvMathCompetition MathReasoning

MATH 500 RL Env (Prime Intellect)

Prime Intellect

MATH-500 evaluation environment

ImplementationRL EnvMath

VF Openbench RL Env (Community)

Environment for single-turn tasks in OpenBench

Trains towardRL Env

NuminaMath

Numina

An 860k-problem competition-math dataset with detailed solutions, the open community's go-to corpus for training math-specialized LLMs.

Training dataSFT DatasetMathScientific Reasoning

OpenThoughts

Open Thoughts

A fully-open distillation of long DeepSeek-R1 reasoning traces - the community's flagship "open R1" SFT corpus for reasoning models.

Training dataSFT DatasetMathCode GenerationScientific Reasoning

s1K

Stanford Center for Research on Foundation Models (CRFM)

Stanford's hand-curated 1,000-problem reasoning dataset that, paired with budget forcing at inference, produced o1-competitive results for ~$50 of compute.

Training dataSFT DatasetMathScientific Reasoning

Papers

Measuring Mathematical Problem Solving With the MATH Dataset

NeurIPS · 2021

Introduces the MATH benchmark of 12,500 competition-level math problems with step-by-step solutions, spanning algebra to number theory at high-school olympiad difficulty.

introduces

Measuring Mathematical Problem Solving With the MATH Dataset

NeurIPS · 2021

Introduces the MATH benchmark of 12,500 competition-level math problems with step-by-step solutions, spanning algebra to number theory at high-school olympiad difficulty.

Contributors

HHunter Lightman DDan Hendrycks

FAQ

What is MATH-500?: 500-problem subset of the Hendrycks MATH competition-math benchmark, popularized by OpenAI's PRM800K work as a standard evaluation slice.
What capabilities does MATH-500 test?: MATH-500 evaluates math, planning.
What is the current top score on MATH-500?: The top reported score is 99.2% by Grok 3 mini, across 178 models reporting (46 from frontier labs).
How can a model improve its MATH-500 score?: Tools linked to MATH-500 on Sophon include MATH 500 RL Env (Community), MATH 500 RL Env (Prime Intellect), VF Openbench RL Env (Community), NuminaMath - RL environments, datasets, and scaffolds that target this eval.
What license is MATH-500 under?: MATH-500 is available under MIT.