BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

Peking University's PKU-SafeRLHF / BeaverTails dataset and Safe-RLHF framework with separated helpfulness and harmlessness reward models for safety alignment.

Open

Publisher: Peking University
Year: 2023
Venue: NeurIPS
ArXiv: arxiv.org/abs/2307.04657
Code: github.com/PKU-Alignment/safe-rlhf
Authors: 10
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2307.04657
TL;DR: semanticscholar.org/paper/92930ed3560ea6c86d53cf52158bc793b089054d
Code: github.com/PKU-Alignment/safe-rlhf

Attribution policy →

Introduces 2 artifacts - 1 eval, 1 tool

TL;DR

Semantic Scholar

The BeaverTails dataset is introduced, aimed at fostering research on safety alignment in large language models (LLMs) and providing vital resources for the community, contributing towards the safe development and deployment of LLMs.

Artifacts

Evals

BeaverTails

Tools

PKU-SafeRLHF

Authors

Boyuan Chen Ce Bian Chi Zhang Jiaming Ji Juntao Dai Mickel Liu Ruiyang Sun Xuehai Pan Yaodong Yang Yizhou Wang