Cite
Notes
Only stored in your browser.
Attribution
Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
arXiv 2025
NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals
arXiv 2024
Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
from 3 papers
Samuel Marks
Aaron Mueller
Adam Belfki
Adam Karvonen
Alexander R. Loftus
Arjun Guha
Arnab Sen Sharma
Byron C. Wallace
Can Rager
Carla Brodley