Neel Nanda

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

arXiv 2025

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

arXiv 2025

Eliciting Secret Knowledge from Language Models

arXiv 2025

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

arXiv 2025

Learning Multi-Level Features with Matryoshka Sparse Autoencoders

arXiv 2025

Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

arXiv 2025

Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models

arXiv 2025

Refusal in Language Models Is Mediated by a Single Direction

arXiv 2024

Transcoders Find Interpretable LLM Feature Circuits

arXiv 2024

Interpreting Attention Layer Outputs with Sparse Autoencoders

arXiv 2024

BatchTopK Sparse Autoencoders

arXiv 2024

Universal Neurons in GPT2 Language Models

arXiv 2024

Confidence Regulation Neurons in Language Models

arXiv 2024

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

arXiv 2024

Progress measures for grokking via mechanistic interpretability

arXiv 2023

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

arXiv 2023

Emergent Linear Representations in World Models of Self-Supervised Sequence Models

arXiv 2023

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

arXiv 2023

Linear Representations of Sentiment in Large Language Models

arXiv 2023