Random Reshuffling Dominates Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is one of the most classical optimization algorithms with favorable theoretical guarantees, yet the practical implementation of SGD differs subtly from its well-known form and is often referred to as Shuffling Stochastic Gradient Descent (Shuffling SGD). A particularly popular strategy in Shuffling SGD is Random Reshuffling (RR), which has achieved great empirical success across numerous experiments. Despite its strong performance, RR has long been considered a heuristic due to a lack of theoretical support. Over the last decade, people have finally established provable convergence rates for RR, thus justifying its observed superiority. However, for smooth convex optimization, two clouds over the convergence theory of RR remain to this day. More precisely, according to the current theory, Shuffling SGD under RR converges only when the stepsize is smaller than a threshold proportional to 1/n, where n is the number of summands in the objective (or the number of data points). Consequently, the optimally tuned theoretical rate of Shuffling SGD under RR is strictly worse than that of SGD when the number of epochs is smaller than another threshold proportional to n. These two restrictions heavily limit the applicability of existing theories and leave a critical mismatch with practice. In this work, for the first time, we prove that RR dominates SGD in smooth convex optimization under any reasonable stepsize after any finite number of epochs, thereby addressing a longstanding open question.