What capabilities does AdvBench test?

AdvBench evaluates safety, jailbreak resistance.

How can a model improve its AdvBench score?

Tools linked to AdvBench on Sophon include PKU-SafeRLHF - RL environments, datasets, and scaffolds that target this eval.

AdvBench is available under MIT.

Active

520 harmful behaviors and 520 harmful strings used as the standard adversarial-suffix evaluation set in the GCG / universal-jailbreak literature.

Publisher: University of California, Berkeley
Capabilities: Safety Jailbreak Resistance
Domain: safety
Format: HF Dataset
Size: 520 tasks
License: MIT
Published: Jul 2023
Notable for: Benchmark for evaluating safety and jailbreak resistance in the safety domain.
Canonical: github.com/llm-attacks/llm-attacks
Also on: huggingface.co/datasets/walledai/AdvBench

Cite

Notes

Only stored in your browser.

Implementations, trainers, datasets and scaffolds linked to this eval.

PKU-Alignment

Peking University's dual-axis safety + helpfulness preference dataset with explicit harm-category labels, designed for Safe RLHF training.

What is AdvBench?: 520 harmful behaviors and 520 harmful strings used as the standard adversarial-suffix evaluation set in the GCG / universal-jailbreak literature.
What capabilities does AdvBench test?: AdvBench evaluates safety, jailbreak resistance.
How can a model improve its AdvBench score?: Tools linked to AdvBench on Sophon include PKU-SafeRLHF - RL environments, datasets, and scaffolds that target this eval.
What license is AdvBench under?: AdvBench is available under MIT.