jailbreak resistance

Category: safety
Slug: jailbreak-resistance
Evals: 3
Tools: 3
Papers: 1

Evals testing this capability

AdvBench

University of California, Berkeley

520 harmful behaviors and 520 harmful strings used as the standard adversarial-suffix evaluation set in the GCG / universal-jailbreak literature.

ActiveSafetyJailbreak Resistance

BeaverTails

PKU-Alignment

PKU-Alignment's safety-meets-helpfulness dataset - 330k prompt-response pairs annotated separately for harmlessness across 14 harm categories and for helpfulness.

ActiveSafetyHarmful ContentBias

HarmBench

University of California, Berkeley

Standardized red-teaming evaluation of 400 harmful behaviors across 18 attacks, scored by a fine-tuned classifier for attack success rate.

ActiveSafetyJailbreak Resistance

Tools lifting evals here

View all

PKU-SafeRLHF

PKU-Alignment

Peking University's dual-axis safety + helpfulness preference dataset with explicit harm-category labels, designed for Safe RLHF training.

PreferenceSafetyJailbreak ResistanceInstruction Following

lifts 2 evals here

Anthropic HH-RLHF

Anthropic

Anthropic's foundational helpful-and-harmless human preference dataset - the first major public RLHF corpus and a long-time community baseline.

PreferenceSafetyJailbreak ResistanceMulti Turn Dialog

lifts 1 eval here

Multitask Reasoning RL Env (Community)

Mixed Polaris math + Reasoning Gym environment for PRIME-RL training and evaluation

RL EnvMultitaskReasoningPolaris

lifts 1 eval here

Top models on this capability

No parseable model scores yet.

Papers in this area

introducesBeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

Related in safety

bias hallucination harmful-content safety