Cite
Notes
Only stored in your browser.
Attribution
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors
arXiv 2025
Belief in the Machine: Investigating Epistemological Blind Spots of Language Models
arXiv 2024
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments
from 3 papers
Christopher Potts
Jing Huang
Aryaman Arora
Atticus Geiger
Dan Jurafsky
Daniel E. Ho
Diyi Yang
Federico Bianchi
James Zou
Junyi Tao