Cite
Notes
Only stored in your browser.
Attribution
Obfuscated Activations Bypass LLM Latent-Space Defenses
arXiv 2024
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task
from 3 papers
Stephen Casper
Aengus Lynch
Aidan Ewart
Alex Serrano
Asa Cooper Stickland
researcher
Carlos Guestrin
Christian Bartelt
Cindy Wu
Dylan Hadfield-Menell
Erik Jenner