Riemannian Gradient Descent for Low-Rank Architectures

We explore Riemannian optimization techniques for rank-factored matrix parameters, targeting contemporary deep learning applications. We examine ten points in the algorithm design space: two geometries for rank-r matrices, three geometries for rank-r partial isometries, and block-matrix variants of these five, where factors are shared across block-rows and block-columns. We apply our methods to the multihead attention parameters in small language models. After tuning learning rates, our methods do not conclusively outperform an AdamW baseline. Our implementations are available online.