s1: Simple Test-Time Scaling

Stanford paper showing that 1K curated reasoning traces plus a budget-forcing inference trick can yield strong test-time scaling - matching o1-preview on AIME / MATH at 32B.

Open

Preview
Publisher: Stanford Center for Research on Foundation Models (CRFM)
Year: 2025
Venue: preprint
ArXiv: arxiv.org/abs/2501.19393
Code: github.com/simplescaling/s1
Authors: 10
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2501.19393
TL;DR: semanticscholar.org/paper/ef8a8bd193b1a0a5e2c834a7a28869a2ec85bab7
Code: github.com/simplescaling/s1

Attribution policy →

Introduces 2 artifacts - 1 tool, 1 model

TL;DR

Semantic Scholar

After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, the model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24).

Artifacts

Tools

s1K

Models

s1-32B

Authors

Emmanuel Candès Hannaneh Hajishirzi Fei-Fei Li Luke Zettlemoyer Niklas Muennighoff Percy Liang Tatsunori Hashimoto Weijia Shi Xiang Lisa Li Zitong Yang