Cite
Notes
Only stored in your browser.
Attribution
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
arXiv 2024
Eliciting Latent Predictions from Transformers with the Tuned Lens
arXiv 2023
Researching Alignment Research: Unsupervised Analysis
arXiv 2022
from 3 papers
Adam Karvonen
Benjamin Wright
Can Rager
Claudio Mayrink Verdun
Danny Halawi
David Bau
Igor Ostrovsky
Jacob Steinhardt
founder
Jacques Thibodeau
Jan H. Kirchner