Training language models to follow instructions with human feedback

The InstructGPT paper that introduced the SFT + reward-model + PPO RLHF recipe and showed a 1.3B aligned model is preferred over the 175B base GPT-3.

Open

Preview
Publisher: OpenAI
Year: 2022
Venue: NeurIPS
ArXiv: arxiv.org/abs/2203.02155
Authors: 21
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2203.02155
TL;DR: semanticscholar.org/paper/d766bffc357127e0dc86dd69561d5aeb520d6f4c

Attribution policy →

Introduces 1 artifact - 1 model

TL;DR

Semantic Scholar

The results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent and showing improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets.

Artifacts

Models

InstructGPT