Anthropic HH-RLHF
Anthropic's foundational helpful-and-harmless human preference dataset - the first major public RLHF corpus and a long-time community baseline.
- Type
- Preference
- Publisher
- Anthropic
- Capabilities
- SafetyJailbreak ResistanceMulti Turn Dialog
- Runtime
jsonl- License
- MIT
- Size
- ~170k preference pairs
- Published
- Apr 2022
Cite
Notes
Only stored in your browser.
Lift evidence
2| Eval | Tools known to lift | Source paper |
|---|---|---|
| TruthfulQA | Anthropic HH-RLHF | - |
| HarmBench | Anthropic HH-RLHF | - |
Models
Notable models trained on it
early Llama-2-Chat-style reproductionscountless academic RLHF / DPO baselinesreward models for research benchmarks