What capabilities does GSM8K test?

GSM8K evaluates math, planning.

What is the current top score on GSM8K?

The top reported score is 90.0% by GLM 4.7, across 8 models reporting (2 from frontier labs).

How can a model improve its GSM8K score?

Tools linked to GSM8K on Sophon include Gsm8k RL Env (Community), Gsm8k RL Env (Sarvam AI Team), P2p Gsm8k RL Env (Sarvam AI Team), Gsm8k RL Env (Dev Team) - RL environments, datasets, and scaffolds that target this eval.

GSM8K

8.5k grade-school math word problems requiring multi-step arithmetic reasoning to reach a single numeric answer.

Open

Publisher: OpenAI
Capabilities: Math Planning
Domain: math
Format: HF Dataset
Size: 8500 tasks
License: MIT
Published: Oct 2021
Notable for: Benchmark for evaluating math and planning in the math domain.
Canonical: github.com/openai/grade-school-math
Also on: huggingface.co/datasets/gsm8k

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: prime-hub

Attribution policy →

Top score 90.0% by GLM 4.7 - 8 models reporting (2 frontier)

Score history

Top models

GSM8KBar chart with 8 bars. Highest value: GLM 4.7 at 90.

8 models

Where it's ranked

Open LLM Leaderboard

Hugging Face

Aggregated

aggregated with 6 others · live

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Gsm8k RL Env (Community)

GSM8K environment

ImplementationRL EnvGsm8kMath

Gsm8k RL Env (Sarvam AI Team)

Sarvam AI Team

GSM8K (grade-school math) environment with last-number verification, built on verifiers>=0.1.12. Single-turn: solve a word problem, ends-with-the-f...

ImplementationRL EnvMathGsm8kGrade School

P2p Gsm8k RL Env (Sarvam AI Team)

Sarvam AI Team

GSM8K (grade-school math) environment with last-number verification, built on verifiers>=0.1.12. Single-turn: solve a word problem, ends-with-the-f...

ImplementationRL EnvMathGsm8kGrade School

Gsm8k RL Env (Dev Team)

Dev Team

GSM8K environment

ImplementationRL EnvGsm8kMath

Gsm8k RL Env (Prime Intellect)

Prime Intellect

GSM8K environment

ImplementationRL EnvGsm8kMath

Discover Gsm8k RL Env (Community)

GSM8K rubric-discovery environment: learn rubric_fn from (input, response, score) examples

ImplementationRL EnvTool UseGsm8kMath

Gsm8k RL Env (Community)

GSM8K environment

ImplementationRL EnvGsm8kMath

Verifiers Math (math-python)

Prime Intellect

Multi-turn math problem-solving environment where the model proposes Python code in a sandbox to compute and verify numerical answers.

Trains towardRL EnvMathTool CallingCode Generation

Certainty Collapse RL Env (Community)

Reward Hacking Sprint: does optimizing self-certainty (RLIF-style intrinsic reward) cause models to be confidently wrong on math? GSM8K, Llama-3.2-...

Trains towardRL EnvRlifSelf CertaintyGsm8k

Compositional Hacks RL Env (Community)

Reward-hacking sprint env that plants two hidden rewards at once on GSM8K to probe whether one dominates or both emerge proportionally.

Trains towardRL EnvCompositionalGsm8kMath

COT Theater RL Env (Community)

Reward-hacking sprint env. Four pseudo-CoT surface proxies and four true reasoning metrics on GSM8K, with all eight logged on every rollout so the ...

Trains towardRL EnvChain of ThoughtProxy True SplitGsm8k

Emergence Prediction RL Env (Community)

Reward-hacking sprint env. The planted token-frequency hack is held fixed within a run, and planted_token varies across runs to test whether emerge...

Trains towardRL EnvEmergenceToken FrequencyGsm8k

Emoji HACK RL Env (Community)

Reward-hacking sprint env. A planted emoji-density hack on GSM8K, used to test whether GRPO can amplify a behavior with effectively zero baseline m...

Trains towardRL EnvEmojiBaseline MassGsm8k

Formatting Emergence RL Env (Community)

Reward-hacking sprint env. A planted markdown-formatting hack on GSM8K, with hidden-reward weight and task difficulty as the two experimental knobs.

Trains towardRL EnvGsm8kMath

Length HACK RL Env (Community)

Reward-hacking sprint env. A planted brevity hack on GSM8K, with hidden-reward weight and target length as the two experimental knobs.

Trains towardRL EnvLengthGsm8kMath

Reasoning HACK RL Env (Community)

Reward-hacking sprint env. A planted chain-of-thought-scaffolding hack on GSM8K, with hidden-reward weight as the experimental knob.

Trains towardRL EnvReasoningGsm8kMath

Regex QC RL Env (Community)

Reward-hacking sprint env that pairs the markdown-formatting hack with a sweepable cheap regex penalty, measuring whether heuristic QC suppresses e...

Trains towardRL EnvQcRegexGsm8k

MATH Reasoning RL Env (Community)

Math reasoning - SingleTurnEnv + MathRubric

Trains towardRL Env

Goodsirmath8k RL Env (Kunumi)

Kunumi

Just GSM8K with the added reward based on how shakespearean the model is.

Trains towardRL EnvMathGsm8kOld English

Gsm8k Olmes RL Env (Community)

GSM8K evaluation matching OLMES tulu_3_dev_no_safety methodology

Trains towardRL EnvMath

LLM Trainer RL Env (Community)

AlphaZero-inspired MCTS environment for training LLMs through tree search and policy learning with teacher ensemble guidance and dual reward systems

Trains towardRL Env

FH Aviary RL Env (Prime Community)

Prime Community

Future House Aviary wrapper for verifiers - Scientific reasoning environments with tools

Trains towardRL EnvAviaryScientific ReasoningTools

FH Aviary RL Env (Prime Intellect)

Prime Intellect

Future House Aviary wrapper for verifiers - Scientific reasoning environments with tools

Trains towardRL EnvAviaryScientific ReasoningTools

MATH Group RL Env (Prime Intellect)

Prime Intellect

Math group environment

Trains towardRL EnvMathGsm8k

Defend Concede RL Env (Community)

GSM8K defend/concede environment for training calibrated self-assessment and sycophancy resistance

Trains towardRL EnvGsm8kMathSycophancy

Gsm8k Multireward RL Env (Community)

GSM8K with multi-reward support (correctness + length, optional gating)

Trains towardRL EnvMathMulti Reward

LAST Number RL Env (Community)

GSM8K environment with a permissive last-number verifier

Trains towardRL EnvGsm8kMathLast Number

Sensible Thinker RL Env (Community)

Encouraging reasoning model to produce more sensible thinking process by ask other model to understand and predict zeroshot answer from only reasoning text

Trains towardRL EnvAddonSensibleEnhance

NuminaMath

Numina

An 860k-problem competition-math dataset with detailed solutions, the open community's go-to corpus for training math-specialized LLMs.

Training dataSFT DatasetMathScientific Reasoning

Tülu 3 SFT Mixture

Allen Institute for AI (Ai2)

Allen AI's flagship open SFT mixture combining new persona-driven prompts with curated public data for post-training a frontier-quality instruct model.

Training dataSFT DatasetInstruction FollowingMathCode Generation

WizardLM Evol-Instruct

Microsoft

Microsoft's "Evol-Instruct" recipe - automatically rewriting simple instructions into harder, more diverse ones using an LLM evolver.

Training dataSFT DatasetInstruction FollowingMathCode Generation

Papers

Training Verifiers to Solve Math Word Problems

preprint · 2021

Introduces GSM8K (8.5k grade-school math word problems) and shows that training a verifier to re-rank generated solutions outperforms simply fine-tuning on the dataset.

introduces

Training Verifiers to Solve Math Word Problems

preprint · 2021

Introduces GSM8K (8.5k grade-school math word problems) and shows that training a verifier to re-rank generated solutions outperforms simply fine-tuning on the dataset.

Contributors

KKarl Cobbe

FAQ

What is GSM8K?: 8.5k grade-school math word problems requiring multi-step arithmetic reasoning to reach a single numeric answer.
What capabilities does GSM8K test?: GSM8K evaluates math, planning.
What is the current top score on GSM8K?: The top reported score is 90.0% by GLM 4.7, across 8 models reporting (2 from frontier labs).
How can a model improve its GSM8K score?: Tools linked to GSM8K on Sophon include Gsm8k RL Env (Community), Gsm8k RL Env (Sarvam AI Team), P2p Gsm8k RL Env (Sarvam AI Team), Gsm8k RL Env (Dev Team) - RL environments, datasets, and scaffolds that target this eval.
What license is GSM8K under?: GSM8K is available under MIT.