0

Neel Nanda

Google DeepMind mechanistic-interpretability team lead; influential MI educator and TransformerLens author.

Role
researcher
Currently at
Google DeepMind
Papers
22

Cite

Notes

Only stored in your browser.

22papers

Authored papers

22

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

arXiv 2026

2026

Thought Anchors: Which LLM Reasoning Steps Matter?

arXiv 2025

2025

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

arXiv 2025

2025

Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

arXiv 2025

2025

Learning Multi-Level Features with Matryoshka Sparse Autoencoders

arXiv 2025

2025

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

arXiv 2025

2025

Eliciting Secret Knowledge from Language Models

arXiv 2025

2025

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

arXiv 2025

2025

Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models

arXiv 2025

2025

Refusal in Language Models Is Mediated by a Single Direction

arXiv 2024

2024

Transcoders Find Interpretable LLM Feature Circuits

arXiv 2024

2024

Interpreting Attention Layer Outputs with Sparse Autoencoders

arXiv 2024

2024

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

arXiv 2024

2024

Universal Neurons in GPT2 Language Models

arXiv 2024

2024

Confidence Regulation Neurons in Language Models

arXiv 2024

2024

BatchTopK Sparse Autoencoders

arXiv 2024

2024

Progress measures for grokking via mechanistic interpretability

arXiv 2023

2023

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

arXiv 2023

2023

Emergent Linear Representations in World Models of Self-Supervised Sequence Models

arXiv 2023

2023

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

arXiv 2023

2023

Linear Representations of Sentiment in Large Language Models

arXiv 2023

2023

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

preprint

2022

Affiliations

Currently at

Google DeepMind

researcher · frontier lab

Frequent co-authors

10

from 22 papers