We show that the core components of the Transformer -- attention, residual connections, and normalization -- arise naturally from a single geometric state estimation problem. Modeling the latent state in polar form, with direction constrained to the hypersphere and uncertainty decomposed into radial and tangential components, yields a precision-weighted filtering procedure in which normalization enforces the hyperspherical constraint, attention aggregates directional evidence, and residual connections implement incremental state updates. Under suitable first-order approximations, this estimator reduces to the standard Transformer block with rotary positional encodings, showing that its architecture follows from the underlying estimation problem rather than from independent design choices. Retaining higher-order geometric corrections yields the proposed Polar Transformer, which more faithfully approximates the underlying radial-tangential state estimator.
The Transformer as a Polar State Estimator
We show that the core components of the Transformer -- attention, residual connections, and normalization -- arise naturally from a single geometric state estimation problem.
- Preview

- Year
- 2026
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2605.11007ARXIV-DEFAULT
- TL;DR
- Semantic Scholar