0

GSM8K

8.5k grade-school math word problems requiring multi-step arithmetic reasoning to reach a single numeric answer.

Publisher
OpenAI
Capabilities
MathPlanning
Domain
math
Format
HF Dataset
Size
8500 tasks
License
MIT
Published
Oct 2021
Notable for
Benchmark for evaluating math and planning in the math domain.

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
prime-hub
Attribution policy →

Top score 90.0% by GLM 4.7 - 8 models reporting (2 frontier)

Score history

7
0%25%50%75%100%Jul 24Dec 24May 25Oct 25Mar 26Llama 3.1 70B InstructQwen3 8BGLM 4.7

Top models

8
GSM8KBar chart with 8 bars. Highest value: GLM 4.7 at 90.
8 models

Where it's ranked

1

Related tools

31
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Papers

2

Contributors

1

FAQ

What is GSM8K?
8.5k grade-school math word problems requiring multi-step arithmetic reasoning to reach a single numeric answer.
What capabilities does GSM8K test?
GSM8K evaluates math, planning.
What is the current top score on GSM8K?
The top reported score is 90.0% by GLM 4.7, across 8 models reporting (2 from frontier labs).
How can a model improve its GSM8K score?
Tools linked to GSM8K on Sophon include Gsm8k RL Env (Community), Gsm8k RL Env (Sarvam AI Team), P2p Gsm8k RL Env (Sarvam AI Team), Gsm8k RL Env (Dev Team) - RL environments, datasets, and scaffolds that target this eval.
What license is GSM8K under?
GSM8K is available under MIT.