As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
Discovering Language Model Behaviors with Model-Written Evaluations
Automatically generated evaluations from language models reveal novel behaviors, including cases of inverse scaling with model size and reinforcement learning from human feedback.
- Year
- 2022
- Venue
- arXiv 2022
- Authors
- 63
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2212.09251ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
63Amanda AskellAndy JonesAnna ChenBen MannCatherine OlssonDanny HernandezDario AmodeiDawn DrainDeep GanguliJack ClarkJackson KernionJared KaplanKamal NdousseLiane LovittNelson ElhageNicholas JosephNova DasSarmaSam McCandlishSaurav KadavathScott JohnstonShauna KravecSheer El ShowkTom BrownTom HenighanTristan HumeYuntao BaiZac Hatfield-DoddsSamuel R. BowmanEthan PerezNicholas SchieferSam RingerEli Tran-JohnsonEvan HubingerKamilė LukošiūtėKarina NguyenEdwin ChenScott HeinerCraig PettitSandipan KunduBrian IsraelBryan SeethorCameron McKinnonChristopher OlahDa YanDaniela AmodeiDustin LiGuro KhundadzeJames LandisJamie KerrJared MuellerJeeyoon HyunJoshua LandauLandon GoldbergMartin LucasMichael SellittoMiranda ZhangNeerav KingslandNoemí MercadoOliver RauschRobin LarsonTamera LanhamTimothy Telleen-LawtonRoger Grosse