0

u-$μ$P: The Unit-Scaled Maximal Update Parametrization

The u-μP scheme combines Maximal Update Parametrization with Unit Scaling to achieve near-optimal default hyperparameters, enabling efficient model training and low-precision operations.

Year
2024
Venue
arXiv 2024
Authors
10
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2407.17465v3ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

The Maximal Update Parametrization ($\mu$P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-$\mu$P, which improves upon $\mu$P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: $\mu$P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-$\mu$P models reaching a loss that is equal to or lower than comparable $\mu$P models and working out-of-the-box in FP8.

Authors

10