Geometrically Principled Randomized Optimization for Efficient LLM Training

Low-rank gradient optimization for large language models is currently divided into two categories: structured methods that rigorously identify subspaces, and randomized approaches employed primarily for computational efficiency. In this work, we question the intuition behind why random projections are effective. We trace this phenomenon to the geometry of the gradient subspaces, which exhibits subspace optimization landscape has a nearly flat curvature, while a significant portion of gradient information lies outside the core subspace. Leveraging these insights, and drawing on randomized linear algebra, we theoretically establish that random low-rank projections preserve the geometry, and we introduce GrassWalk and GrassJump, algorithms that navigate the Grassmannian manifold via random walks and jumps. By coupling this randomized exploration with subspace-aware optimizer and recovering the lost gradient signals, we achieve state-of-the-art results on LLaMA-1B, LLaMA-7B, and Qwen-1.5B pretraining. Our findings reframe randomization not merely as a computational shortcut, but as a geometrically principled approach to high-dimensional optimizations.