0

Decentralized Stochastic Nonconvex Optimization under the $(L_0,L_1)$-Smoothness

This paper focuses on the decentralized stochastic optimization problem $f(\mathbf{x})=\frac{1}{m}\sum_{i=1}^m f_i(\mathbf{x})$ over a connected network of $n$ agents, where each local function has the form of $f_i(\mathbf{x}) = {\mathbb E}\left[F(\mathbf{x};{\boldsymbol…

Preview
Year
2025
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2509.08726CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

This paper focuses on the decentralized stochastic optimization problem f(x)=\frac{1}{m}\sum_{i=1}^m f_i(x) over a connected network of n agents, where each local function has the form of f_i(x) = {\mathbb E}\left[F(x;{\boldsymbol ξ}_i)\right] which satisfies the (L_0,L_1)-smooth condition but possibly nonconvex and each random variable {\boldsymbol ξ}_i follows distribution {\mathcal D}_i. We propose a novel algorithm called decentralized normalized stochastic gradient descent (DNSGD), which can achieve an ε-stationary point at each local agent. We present a new framework for analyzing decentralized first-order methods in the (L_0,L_1)-smooth setting, based on the Lyapunov function related to the product of the gradient norm and the consensus error. We show that the proposed algorithm attains the upper bounds on the sample complexity of {\mathcal O}(m^{-1}(L_fσ^2Δ_fε^{-4} + σ^2ε^{-2} + L_f^{-2}L_1^3σ^2Δ_fε^{-1} + L_f^{-2}L_1^2σ^2)) per agent and the communication complexity of \tilde{\mathcal O}((L_fε^{-2} + L_1ε^{-1})γ^{-1/2}Δ_f), where L_f=L_0 +L_1ζ, σ^2 is the variance of the stochastic gradient, Δ_f is the initial optimal function value gap, γ is the spectral gap of the network, and ζ is the degree of the gradient dissimilarity. In the special case of L_1=0, the above results (nearly) match the lower bounds of decentralized stochastic nonconvex optimization under the standard smoothness. We also conduct numerical experiments to show the empirical superiority of our method.