jailbreak-resistance
University of California, Berkeley
520 harmful behaviors and 520 harmful strings used as the standard adversarial-suffix evaluation set in the GCG / universal-jailbreak literature.
PKU-Alignment
PKU-Alignment's safety-meets-helpfulness dataset - 330k prompt-response pairs annotated separately for harmlessness across 14 harm categories and for helpfulness.
Standardized red-teaming evaluation of 400 harmful behaviors across 18 attacks, scored by a fine-tuned classifier for attack success rate.
Peking University's dual-axis safety + helpfulness preference dataset with explicit harm-category labels, designed for Safe RLHF training.
Anthropic
Anthropic's foundational helpful-and-harmless human preference dataset - the first major public RLHF corpus and a long-time community baseline.
Mixed Polaris math + Reasoning Gym environment for PRIME-RL training and evaluation