0

Convergence Rate Analysis of the AdamW-style Shampoo: Unifying One-Sided and Two-Sided Preconditioning

This paper studies AdamW-style Shampoo, an effective variant of the classical Shampoo that won the external tuning track of the AlgoPerf neural network training competition. Our analysis unifies one-sided and two-sided preconditioning.

Preview
Year
2026
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2601.07326CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

This paper studies AdamW-style Shampoo, an effective variant of the classical Shampoo that won the external tuning track of the AlgoPerf neural network training competition. Our analysis unifies one-sided and two-sided preconditioning. When the exponents of the two preconditioners sum to 1/2, we establish the convergence rate \frac{1}{K}\sum_{k=1}^KE\left[||\nabla f(X_k)||_\right]\leq O(\frac{\sqrt{m+n}C}{K^{1/4}}), where K represents the number of iterations, (m,n) denotes the dimensions of the matrix-valued parameters, and C matches the constant appearing in the optimal convergence rate of SGD. Theoretically, the nuclear norm and Frobenius norm satisfy ||\nabla f(X)||F\leq ||\nabla f(X)||\leq \sqrt{\min{m,n}}||\nabla f(X)||F, which suggests that our convergence rate is analogous to the optimal \frac{1}{K}\sum{k=1}^KE\left[||\nabla f(X_k)||F\right]\leq O(\frac{C}{K^{1/4}}) convergence rate of SGD in the ideal case where ||\nabla f(X)||*= Θ(\sqrt{\min{m,n}})||\nabla f(X)||_F and m and n are of comparable magnitude. Then, we extend our analysis to settings where the preconditioning exponents do not sum to 1/2, and establish convergence with an explicit but more involved rate.