Yonatan Belinkov
- Papers
- 21
Cite
Notes
Only stored in your browser.
Authored papers
21SAEs Are Good for Steering -- If You Select the Right Features
arXiv 2025
Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models
arXiv 2025
Position-aware Automatic Circuit Discovery
arXiv 2025
Inside-Out: Hidden Factual Knowledge in LLMs
arXiv 2025
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
arXiv 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
arXiv 2024
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
arXiv 2024
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics
arXiv 2024
Jamba-1.5: Hybrid Transformer-Mamba Models at Scale
arXiv 2024
Distinguishing Ignorance from Error in LLM Hallucinations
arXiv 2024
Confidence Regulation Neurons in Language Models
arXiv 2024
Semantics and Spatiality of Emergent Communication
arXiv 2024
A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis
arXiv 2023
Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias
arXiv 2023
Unified Concept Editing in Diffusion Models
arXiv 2023
Generating Benchmarks for Factuality Evaluation of Language Models
arXiv 2023
VISIT: Visualizing and Interpreting the Semantic Information Flow of Transformers
arXiv 2023
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
TMLR
Locating and Editing Factual Associations in GPT
arXiv 2022
Mass-Editing Memory in a Transformer
arXiv 2022
Improving Neural Language Models by Segmenting, Attending, and Predicting the Future
improving-neural-language-models-by-1
Affiliations
Frequent co-authors
10from 21 papers