s1: Simple Test-Time Scaling
Stanford paper showing that 1K curated reasoning traces plus a budget-forcing inference trick can yield strong test-time scaling - matching o1-preview on AIME / MATH at 32B.
- Year
- 2025
- Venue
- preprint
- Authors
- 10
- Hosting
- External sourcelicense unknown
Cite
Notes
Only stored in your browser.
Introduces 2 artifacts - 1 tool, 1 model
TL;DR
Semantic Scholar
After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, the model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24).