0

HarmBench

Active

Standardized red-teaming evaluation of 400 harmful behaviors across 18 attacks, scored by a fine-tuned classifier for attack success rate.

Domain
safety
Format
Custom
Size
400 tasks
License
MIT
Published
Feb 2024
Notable for
Benchmark for evaluating safety and jailbreak resistance in the safety domain.
Canonical
harmbench.org

Cite

Notes

Only stored in your browser.

Related tools

3
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Contributors

1

FAQ

What is HarmBench?
Standardized red-teaming evaluation of 400 harmful behaviors across 18 attacks, scored by a fine-tuned classifier for attack success rate.
What capabilities does HarmBench test?
HarmBench evaluates safety, jailbreak resistance.
How can a model improve its HarmBench score?
Tools linked to HarmBench on Sophon include Multitask Reasoning RL Env (Community), Anthropic HH-RLHF, PKU-SafeRLHF - RL environments, datasets, and scaffolds that target this eval.
What license is HarmBench under?
HarmBench is available under MIT.