HarmBench
Active
Standardized red-teaming evaluation of 400 harmful behaviors across 18 attacks, scored by a fine-tuned classifier for attack success rate.
- Publisher
- University of California, Berkeley
- Capabilities
- SafetyJailbreak Resistance
- Domain
- safety
- Format
- Custom
- Size
- 400 tasks
- License
- MIT
- Published
- Feb 2024
- Notable for
- Benchmark for evaluating safety and jailbreak resistance in the safety domain.
- Canonical
- harmbench.org
Cite
Notes
Only stored in your browser.
Related tools
3Implementations, trainers, datasets and scaffolds linked to this eval.
Contributors
1FAQ
- What is HarmBench?
- Standardized red-teaming evaluation of 400 harmful behaviors across 18 attacks, scored by a fine-tuned classifier for attack success rate.
- What capabilities does HarmBench test?
- HarmBench evaluates safety, jailbreak resistance.
- How can a model improve its HarmBench score?
- Tools linked to HarmBench on Sophon include Multitask Reasoning RL Env (Community), Anthropic HH-RLHF, PKU-SafeRLHF - RL environments, datasets, and scaffolds that target this eval.
- What license is HarmBench under?
- HarmBench is available under MIT.