Toy Models of Superposition

Polysemanticity in neural networks leads to complex interpretability issues and is explored through a toy model, revealing phase changes, geometric connections, and potential links to adversarial examples.

Open

Preview
Year: 2022
Venue: arXiv 2022
ArXiv: arxiv.org/abs/2209.10652
Authors: 16
Hosting: Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2209.10652ARXIV-DEFAULT
TL;DR: Semantic Scholar

Attribution policy →

Abstract

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.

Authors

Catherine Olsson Dario Amodei Dawn Drain Jared Kaplan Nelson Elhage Sam McCandlish Shauna Kravec Tom Henighan Tristan Hume Zac Hatfield-Dodds Nicholas Schiefer Martin Wattenberg Christopher Olah Roger Grosse Robert Lasenby Carol Chen