Cite
Notes
Only stored in your browser.
Attribution
Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs
arXiv 2025
Refusal in Language Models Is Mediated by a Single Direction
arXiv 2024
Steering Llama 2 via Contrastive Activation Addition
arXiv 2023
from 3 papers
Aaquib Syed
Alex Cloud
Alexander Matt Turner
Andy Arditi
Aryo Pradipta Gema
Cem Anil
Daniel Paleka
Erik Jones
Evan Hubinger
Henry Sleight