Cite
Notes
Only stored in your browser.
Attribution
Alignment faking in large language models
arXiv 2024
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
from 2 papers
Adam Karvonen
Akbir Khan
Buck Shlegeris
Can Rager
Carson Denison
Claudio Mayrink Verdun
David Bau
David Duvenaud
Ethan Perez
Evan Hubinger