0

Adaptive Momentum and Nonlinear Damping for Neural Network Training

Momentum Stochastic Gradient Descent (mSGD) relies on a fixed momentum coefficient shared across all parameters, failing to account for the heterogeneous structure of modern loss landscapes.

Preview
Year
2026
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2602.00334ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Momentum Stochastic Gradient Descent (mSGD) relies on a fixed momentum coefficient shared across all parameters, failing to account for the heterogeneous structure of modern loss landscapes. In this work, we adopt a continuous-time formulation to introduce individual, adaptive momentum coefficients regulated by the kinetic energy of each model parameter. This mechanism automatically adjusts to evolving training dynamics to maintain stability without sacrificing convergence speed. We demonstrate that this adaptive friction is inextricably linked to cubic damping, a suppression mechanism from structural dynamics. We additionally introduce two optimization schemes by augmenting the continuous dynamics of mSGD and Adam with a cubic damping term. Empirically, our methods demonstrate robustness and match or outperform Adam on training ViT, BERT, and GPT2 tasks where mSGD typically struggles. We further provide theoretical results establishing the exponential convergence of the proposed schemes.