0

The Transformer as a Polar State Estimator

We show that the core components of the Transformer -- attention, residual connections, and normalization -- arise naturally from a single geometric state estimation problem.

Preview
Year
2026
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2605.11007ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

We show that the core components of the Transformer -- attention, residual connections, and normalization -- arise naturally from a single geometric state estimation problem. Modeling the latent state in polar form, with direction constrained to the hypersphere and uncertainty decomposed into radial and tangential components, yields a precision-weighted filtering procedure in which normalization enforces the hyperspherical constraint, attention aggregates directional evidence, and residual connections implement incremental state updates. Under suitable first-order approximations, this estimator reduces to the standard Transformer block with rotary positional encodings, showing that its architecture follows from the underlying estimation problem rather than from independent design choices. Retaining higher-order geometric corrections yields the proposed Polar Transformer, which more faithfully approximates the underlying radial-tangential state estimator.