Cite
Notes
Only stored in your browser.
Attribution
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
arXiv 2024
Sparse Autoencoders Find Highly Interpretable Features in Language Models
arXiv 2023
from 2 papers
Abhay Sheshadri
Aengus Lynch
Asa Cooper Stickland
researcher
Cindy Wu
Dylan Hadfield-Menell
Ethan Perez
Henry Sleight
Hoagy Cunningham
Lee Sharkey
Logan Riggs