harmful content

Category: safety
Slug: harmful-content
Evals: 2
Papers: 2

Evals testing this capability

BeaverTails

PKU-Alignment

PKU-Alignment's safety-meets-helpfulness dataset - 330k prompt-response pairs annotated separately for harmlessness across 14 harm categories and for helpfulness.

ActiveSafetyHarmful ContentBias

Anthropic HH-RLHF

Anthropic

Anthropic's helpful & harmless preference dataset - paired human-rated assistant responses widely used both as a preference-training corpus and as a reward-model benchmark.

ActiveSafetyHarmful ContentInstruction Following

Tools lifting evals here

View all

No tools yet target evals in this capability.

Top models on this capability

No parseable model scores yet.

Papers in this area

introducesBeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset introducesTraining a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Related in safety

bias hallucination jailbreak-resistance safety