harmful-content
PKU-Alignment
PKU-Alignment's safety-meets-helpfulness dataset - 330k prompt-response pairs annotated separately for harmlessness across 14 harm categories and for helpfulness.
Anthropic
Anthropic's helpful & harmless preference dataset - paired human-rated assistant responses widely used both as a preference-training corpus and as a reward-model benchmark.