Neel Nanda
Google DeepMind mechanistic-interpretability team lead; influential MI educator and TransformerLens author.
- Role
- researcher
- Currently at
- Google DeepMind
- twitter.com/NeelNanda5
- GitHub
- github.com/neelnanda-io
- Scholar
- scholar.google.com/citations
- Papers
- 22
Cite
Notes
Only stored in your browser.
Authored papers
22Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
arXiv 2026
Thought Anchors: Which LLM Reasoning Steps Matter?
arXiv 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
arXiv 2025
Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit
arXiv 2025
Learning Multi-Level Features with Matryoshka Sparse Autoencoders
arXiv 2025
Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
arXiv 2025
Eliciting Secret Knowledge from Language Models
arXiv 2025
Towards eliciting latent knowledge from LLMs with mechanistic interpretability
arXiv 2025
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
arXiv 2025
Refusal in Language Models Is Mediated by a Single Direction
arXiv 2024
Transcoders Find Interpretable LLM Feature Circuits
arXiv 2024
Interpreting Attention Layer Outputs with Sparse Autoencoders
arXiv 2024
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
arXiv 2024
Universal Neurons in GPT2 Language Models
arXiv 2024
Confidence Regulation Neurons in Language Models
arXiv 2024
BatchTopK Sparse Autoencoders
arXiv 2024
Progress measures for grokking via mechanistic interpretability
arXiv 2023
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
arXiv 2023
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
arXiv 2023
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching
arXiv 2023
Linear Representations of Sentiment in Large Language Models
arXiv 2023
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
preprint
Affiliations
Previously
Frequent co-authors
10from 22 papers