0

MATH-500

Saturated

500-problem subset of the Hendrycks MATH competition-math benchmark, popularized by OpenAI's PRM800K work as a standard evaluation slice.

Publisher
OpenAI
Capabilities
MathPlanning
Domain
math
Format
HF Dataset
Size
500 tasks
License
MIT
Published
Mar 2021
Notable for
Benchmark for evaluating math and planning in the math domain.

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
AAprime-hub
Attribution policy →

Top score 99.2% by Grok 3 mini - 178 models reporting (46 frontier)

Score history

178
0%25%50%75%100%Nov 22Jun 23Jan 24Aug 24Mar 25GPT-3.5 TurboGPT-4 TurboGPT-4o (2024-05-13)GPT-4o (2024-08-06)o1Grok 3 mini

Top models

178
MATH-500Bar chart with 21 bars. Highest value: o3 at 99.2.
21 models

Where it's ranked

2

Related tools

6
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Papers

2

Contributors

2

FAQ

What is MATH-500?
500-problem subset of the Hendrycks MATH competition-math benchmark, popularized by OpenAI's PRM800K work as a standard evaluation slice.
What capabilities does MATH-500 test?
MATH-500 evaluates math, planning.
What is the current top score on MATH-500?
The top reported score is 99.2% by Grok 3 mini, across 178 models reporting (46 from frontier labs).
How can a model improve its MATH-500 score?
Tools linked to MATH-500 on Sophon include MATH 500 RL Env (Community), MATH 500 RL Env (Prime Intellect), VF Openbench RL Env (Community), NuminaMath - RL environments, datasets, and scaffolds that target this eval.
What license is MATH-500 under?
MATH-500 is available under MIT.