0

On the Residual Scaling of Looped Transformers: Stability and Transferability

Looped (weight-tied) Transformers apply a shared residual block $N$ times ($h \leftarrow h + \varepsilon\,f(h)$, same $f$ at each step), increasing effective depth without adding parameters.

Preview
Year
2026
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2606.18524ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Looped (weight-tied) Transformers apply a shared residual block N times (h \leftarrow h + \varepsilon,f(h), same f at each step), increasing effective depth without adding parameters. Prior depth-scaling analyses prescribe \varepsilon = 1/!\sqrt{L} for depth-L residual networks. We show that this is insufficient for looped architectures: weight sharing makes residual updates correlated across iterations, requiring the stronger scaling \varepsilon = 1/N. For multi-layer blocks (L unique layers looped N times), we derive a factored parameterization \varepsilon = λ/(N!\sqrt{L}) that separates the two sources of growth: 1/N controls the within-layer loop correlation, and 1/!\sqrt{L} controls the across-layer variance. A key consequence is that the optimal learning rate depends only on the number of unique layers L, not on the loop count N, enabling direct hyperparameter transfer from small to large N without retuning. Experiments on looped Transformers confirm that 1/N scaling improves trainability and yields better loss than 1/!\sqrt{N} scaling across loop counts.