0

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Anthropic's foundational paper on RLHF for chat assistants, releasing the HH-RLHF preference dataset of helpful + harmless human comparisons.

Publisher
Anthropic
Year
2022
Venue
preprint
Authors
31
Hosting
External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Introduces 2 artifacts - 1 eval, 1 tool

TL;DR

Semantic Scholar

An iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, and a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization is identified.

Artifacts

2

Authors

31