What capabilities does HarmBench test?

HarmBench evaluates safety, jailbreak resistance.

How can a model improve its HarmBench score?

Tools linked to HarmBench on Sophon include Multitask Reasoning RL Env (Community), Anthropic HH-RLHF, PKU-SafeRLHF - RL environments, datasets, and scaffolds that target this eval.

What license is HarmBench under?

HarmBench is available under MIT.

HarmBench

Active

Standardized red-teaming evaluation of 400 harmful behaviors across 18 attacks, scored by a fine-tuned classifier for attack success rate.

Open

Publisher: University of California, Berkeley
Capabilities: Safety Jailbreak Resistance
Domain: safety
Format: Custom
Size: 400 tasks
License: MIT
Published: Feb 2024
Notable for: Benchmark for evaluating safety and jailbreak resistance in the safety domain.
Canonical: harmbench.org
Also on: github.com/centerforaisafety/HarmBench

Cite

Notes

Only stored in your browser.

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Multitask Reasoning RL Env (Community)

Mixed Polaris math + Reasoning Gym environment for PRIME-RL training and evaluation

Trains towardRL EnvMultitaskReasoningPolaris

Anthropic HH-RLHF

Anthropic

Anthropic's foundational helpful-and-harmless human preference dataset - the first major public RLHF corpus and a long-time community baseline.

Training dataPreferenceSafetyJailbreak ResistanceMulti Turn Dialog

PKU-SafeRLHF

PKU-Alignment

Peking University's dual-axis safety + helpfulness preference dataset with explicit harm-category labels, designed for Safe RLHF training.

Training dataPreferenceSafetyJailbreak ResistanceInstruction Following

Contributors

DDan Hendrycks

FAQ

What is HarmBench?: Standardized red-teaming evaluation of 400 harmful behaviors across 18 attacks, scored by a fine-tuned classifier for attack success rate.
What capabilities does HarmBench test?: HarmBench evaluates safety, jailbreak resistance.
How can a model improve its HarmBench score?: Tools linked to HarmBench on Sophon include Multitask Reasoning RL Env (Community), Anthropic HH-RLHF, PKU-SafeRLHF - RL environments, datasets, and scaffolds that target this eval.
What license is HarmBench under?: HarmBench is available under MIT.