Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.
Toy Models of Superposition
Polysemanticity in neural networks leads to complex interpretability issues and is explored through a toy model, revealing phase changes, geometric connections, and potential links to adversarial examples.
- Year
- 2022
- Venue
- arXiv 2022
- Authors
- 16
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2209.10652ARXIV-DEFAULT
- TL;DR
- Semantic Scholar