0

Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning

Muon has emerged as a promising optimizer for large-scale foundation model pre-training by exploiting the matrix structure of neural network updates through iterative orthogonalization.

Preview
Year
2026
Hosting
Excerpt onlyCC-BY-NC-SA-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2604.09967CC-BY-NC-SA-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Muon has emerged as a promising optimizer for large-scale foundation model pre-training by exploiting the matrix structure of neural network updates through iterative orthogonalization. However, the orthogonalization quality of Muon hinges on the number of Newton--Schulz (NS) iterations performed, which poses efficiency challenges due to its non-trivial computation and communication cost. We propose Muon^2, an extension of Muon, to improve both quality and efficiency by applying Adam-style adaptive second-moment preconditioning before orthogonalization. Our key insight is that the core challenge of polar approximation in Muon lies in the ill-conditioned momentum matrix, of which the spectrum is substantially improved by Muon^2, leading to faster convergence toward a practically sufficient orthogonalization. We further characterize the practical orthogonalization quality via directional alignment, under which Muon^2 demonstrates dramatic improvement over Muon at each polar step. Across GPT, LLaMA, and Mixture-of-Experts pre-training experiments up to 13B parameters, Muon^2 (and its memory-efficient variant Muon^2-F that preserves most of its benefits) consistently outperforms Muon and its variants while reducing NS iterations by 40%, and saves up to 1/4 training time over Muon when achieving the same loss.