safety

Category: safety
Slug: safety
Evals: 8
Tools: 5
Papers: 4

Evals testing this capability

8

AdvBench

University of California, Berkeley

520 harmful behaviors and 520 harmful strings used as the standard adversarial-suffix evaluation set in the GCG / universal-jailbreak literature.

ActiveSafetyJailbreak Resistance

BeaverTails

PKU-Alignment

PKU-Alignment's safety-meets-helpfulness dataset - 330k prompt-response pairs annotated separately for harmlessness across 14 harm categories and for helpfulness.

ActiveSafetyHarmful ContentBias

HarmBench

University of California, Berkeley

Standardized red-teaming evaluation of 400 harmful behaviors across 18 attacks, scored by a fine-tuned classifier for attack success rate.

ActiveSafetyJailbreak Resistance

HELM (Holistic Evaluation of Language Models)

Stanford Center for Research on Foundation Models (CRFM)

Stanford CRFM's wide-coverage evaluation framework - dozens of scenarios scored on accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency.

ActiveFactual RecallSafetyInstruction Following

Anthropic HH-RLHF

Anthropic

Anthropic's helpful & harmless preference dataset - paired human-rated assistant responses widely used both as a preference-training corpus and as a reward-model benchmark.

ActiveSafetyHarmful ContentInstruction Following

RewardBench 2

Allen Institute for AI (Ai2)

2025 successor to RewardBench - harder, multi-completion (not just chosen-vs-rejected), with refreshed prompts to address contamination.

ActiveLLM JudgingSafety

RewardBench

Allen Institute for AI (Ai2)

2,985 prompt-chosen-rejected triples across chat, reasoning, safety, and code - a benchmark for evaluating reward models and LLM judges.

ActiveLLM JudgingSafety

XSTest

University of Oxford

250 safe + 200 unsafe prompts crafted to test exaggerated safety - does the model refuse benign requests that superficially resemble unsafe ones?

ActiveSafetyInstruction Following

Tools lifting evals here

5

PKU-SafeRLHF

PKU-Alignment

Peking University's dual-axis safety + helpfulness preference dataset with explicit harm-category labels, designed for Safe RLHF training.

PreferenceSafetyJailbreak ResistanceInstruction Following

lifts 3 evals here

Reward Bench RL Env (Prime Intellect)

Prime Intellect

Evaluates pair-wise answers from RewardBench datasets

RL EnvMulti LingualReward BenchSafetySecurity

lifts 2 evals here

Anthropic HH-RLHF

Anthropic

Anthropic's foundational helpful-and-harmless human preference dataset - the first major public RLHF corpus and a long-time community baseline.

PreferenceSafetyJailbreak ResistanceMulti Turn Dialog

lifts 1 eval here

HelpSteer2

NVIDIA

NVIDIA's permissively-licensed human-annotated preference dataset with 5-axis Likert ratings - engineered to train high-quality reward models.

PreferenceInstruction FollowingSafetyHallucination

lifts 1 eval here

Multitask Reasoning RL Env (Community)

Mixed Polaris math + Reasoning Gym environment for PRIME-RL training and evaluation

RL EnvMultitaskReasoningPolaris

lifts 1 eval here

Top models on this capability

No parseable model scores yet.

Papers in this area

4

introducesBeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset introducesHolistic Evaluation of Language Models introducesTraining a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback introducesRewardBench 2: Advancing Reward Model Evaluation

Related in safety

bias hallucination harmful-content jailbreak-resistance