Looped (weight-tied) Transformers apply a shared residual block N times (h \leftarrow h + \varepsilon,f(h), same f at each step), increasing effective depth without adding parameters. Prior depth-scaling analyses prescribe \varepsilon = 1/!\sqrt{L} for depth-L residual networks. We show that this is insufficient for looped architectures: weight sharing makes residual updates correlated across iterations, requiring the stronger scaling \varepsilon = 1/N. For multi-layer blocks (L unique layers looped N times), we derive a factored parameterization \varepsilon = λ/(N!\sqrt{L}) that separates the two sources of growth: 1/N controls the within-layer loop correlation, and 1/!\sqrt{L} controls the across-layer variance. A key consequence is that the optimal learning rate depends only on the number of unique layers L, not on the loop count N, enabling direct hyperparameter transfer from small to large N without retuning. Experiments on looped Transformers confirm that 1/N scaling improves trainability and yields better loss than 1/!\sqrt{N} scaling across loop counts.
On the Residual Scaling of Looped Transformers: Stability and Transferability
Looped (weight-tied) Transformers apply a shared residual block $N$ times ($h \leftarrow h + \varepsilon\,f(h)$, same $f$ at each step), increasing effective depth without adding parameters.
- Preview

- Year
- 2026
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2606.18524ARXIV-DEFAULT
- TL;DR
- Semantic Scholar