Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Anthropic's foundational paper on RLHF for chat assistants, releasing the HH-RLHF preference dataset of helpful + harmless human comparisons.
- Publisher
- Anthropic
- Year
- 2022
- Venue
- preprint
- Authors
- 31
- Hosting
- External sourcelicense unknown
Cite
Notes
Only stored in your browser.
Introduces 2 artifacts - 1 eval, 1 tool
TL;DR
Semantic Scholar
An iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, and a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization is identified.
Artifacts
2Evals
Tools
Authors
31Amanda AskellAndy JonesAnna ChenBen MannCatherine OlssonChris OlahDanny HernandezDario AmodeiDawn DrainDeep GanguliJack ClarkJackson KernionJared KaplanKamal NdousseLiane LovittNeel NandaNelson ElhageNicholas JosephNova DasSarmaSam McCandlishSaurav KadavathScott JohnstonShauna KravecSheer El ShowkStanislav FortTom BrownTom ConerlyTom HenighanTristan HumeYuntao BaiZac Hatfield-Dodds