0

s1: Simple Test-Time Scaling

Stanford paper showing that 1K curated reasoning traces plus a budget-forcing inference trick can yield strong test-time scaling - matching o1-preview on AIME / MATH at 32B.

Year
2025
Venue
preprint
Authors
10
Hosting
External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Introduces 2 artifacts - 1 tool, 1 model

TL;DR

Semantic Scholar

After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, the model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24).

Artifacts

2

Tools

Models

Authors

10