Cite
Notes
Only stored in your browser.
Attribution
Refusal in Language Models Is Mediated by a Single Direction
arXiv 2024
from 1 papers
Andy Arditi
Daniel Paleka
Neel Nanda
researcher
Nina Panickssery
Oscar Obeso
Wes Gurnee