To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters

While Adam has long been the ubiquitous default optimizer for deep neural networks, Muon has recently seen rapid adoption due to its superior training speed. Although much of the literature focuses on validating the benefits of Muon, our work investigates the potential downsides of the mechanism driving this speedup. On the theoretical front, we analyze the learning dynamics of simplified Muon on deep linear networks and linear attention. Our analysis reveals that Muon gains speed by avoiding saddle points, but does so at the expense of the simplicity bias characteristic of Gradient Descent (GD), where the complexity of the functional solution learned grows sequentially. Experiments demonstrate the consequences of losing the simplicity bias, showing that Muon struggles to uncover common underlying structure across tasks and may be prone to fitting spurious features. More broadly, this paper serves as a reminder that faster optimization is rarely a free lunch; improvements in optimization can come at the cost of changes in the inductive biases that shape generalization.