What capabilities does Anthropic HH-RLHF test?

Anthropic HH-RLHF evaluates safety, harmful content, instruction following.

What license is Anthropic HH-RLHF under?

Anthropic HH-RLHF is available under MIT.

Anthropic HH-RLHF

Active

Anthropic's helpful & harmless preference dataset - paired human-rated assistant responses widely used both as a preference-training corpus and as a reward-model benchmark.

Open

Publisher: Anthropic
Capabilities: Safety Harmful Content Instruction Following
Domain: safety
Format: HF Dataset
Size: ~170k preference pairs (helpful: ~118k base + online + rejection-sampled; harmless: ~42k) tasks
License: MIT
Published: Apr 2022
Notable for: Benchmark for evaluating safety, harmful content and instruction following in the safety domain.
Canonical: huggingface.co/datasets/Anthropic/hh-rlhf

Cite

Notes

Only stored in your browser.

Papers

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

preprint · 2022

Anthropic's foundational paper on RLHF for chat assistants, releasing the HH-RLHF preference dataset of helpful + harmless human comparisons.

introduces

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

preprint · 2022

Anthropic's foundational paper on RLHF for chat assistants, releasing the HH-RLHF preference dataset of helpful + harmless human comparisons.

FAQ

What is Anthropic HH-RLHF?: Anthropic's helpful & harmless preference dataset - paired human-rated assistant responses widely used both as a preference-training corpus and as a reward-model benchmark.
What capabilities does Anthropic HH-RLHF test?: Anthropic HH-RLHF evaluates safety, harmful content, instruction following.
What license is Anthropic HH-RLHF under?: Anthropic HH-RLHF is available under MIT.