Training language models to follow instructions with human feedback
The InstructGPT paper that introduced the SFT + reward-model + PPO RLHF recipe and showed a 1.3B aligned model is preferred over the 175B base GPT-3.
- Publisher
- OpenAI
- Year
- 2022
- Venue
- NeurIPS
- Authors
- 21
- Hosting
- External sourcelicense unknown
Cite
Notes
Only stored in your browser.
Introduces 1 artifact - 1 model
TL;DR
Semantic Scholar
The results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent and showing improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets.
Artifacts
1Models