Martin Wattenberg
- Papers
- 14
Cite
Notes
Only stored in your browser.
Authored papers
14Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls
arXiv 2025
Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models
arXiv 2025
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner
arXiv 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
arXiv 2024
Designing a Dashboard for Transparency and Control of Conversational AI
arXiv 2024
Q-Probe: A Lightweight Approach to Reward Maximization for Language Models
arXiv 2024
Measuring and Controlling Instruction (In)Stability in Language Model Dialogs
arXiv 2024
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
NeurIPS 2023 11
Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model
arXiv 2023
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
arXiv 2023
Toy Models of Superposition
arXiv 2022
Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task
arXiv 2022
GAN Lab: Understanding Complex Deep Generative Models using Interactive Visual Experimentation
arXiv 2018
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)
interpretability-beyond-feature-attribution-1
Affiliations
Frequent co-authors
10from 14 papers