Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Anthropic's foundational paper on RLHF for chat assistants, releasing the HH-RLHF preference dataset of helpful + harmless human comparisons.

Open

Publisher: Anthropic
Year: 2022
Venue: preprint
ArXiv: arxiv.org/abs/2204.05862
Code: github.com/anthropics/hh-rlhf
Authors: 31
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2204.05862
TL;DR: semanticscholar.org/paper/0286b2736a114198b25fb5553c671c33aed5d477
Code: github.com/anthropics/hh-rlhf

Attribution policy →

Introduces 2 artifacts - 1 eval, 1 tool

TL;DR

Semantic Scholar

An iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, and a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization is identified.

Artifacts

Evals

Anthropic HH-RLHF

Tools

Anthropic HH-RLHF