Let's Verify Step by Step

OpenAI paper that releases PRM800K, an 800K-step process reward dataset, and shows process supervision substantially beats outcome supervision on MATH.

Open

Publisher: OpenAI
Year: 2023
Venue: ICLR
ArXiv: arxiv.org/abs/2305.20050
Code: github.com/openai/prm800k
Authors: 10
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2305.20050
TL;DR: semanticscholar.org/paper/be8db99310602d66bba64bcf41a572c45816fbfc
Code: github.com/openai/prm800k

Attribution policy →

TL;DR

Semantic Scholar

This work conducts its own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset and shows that active learning significantly improves the efficacy of process supervision.

Authors

Bowen Baker Harri Edwards Hunter Lightman Ilya Sutskever Jan Leike John Schulman Karl Cobbe Teddy Lee Vineet Kosaraju Yuri "Yura" Burda