Scaling depth capacity via zero/one-layer model expansion

Model depth is a double-edged sword in deep learning: deeper models achieve higher accuracy but require higher computational cost. To efficiently train models at scale, progressive training (also known as model expansion) scales up model capacity during training and significantly reduces computation with little performance degradation. In this work, we study the depth expansion of large-scale models through the lens of optimization theory and feature learning, offering insights on the initialization of new layers, hyperparameter transfer, learning rate schedule, and timing of model expansion. Specifically, we propose zero/one-layer progressive training to achieve an optimal tradeoff between computation and loss, with a comprehensive ablations on our expansion strategy. For example, zero/one-layer progressive training on GPT2 can save \approx 80% compute, or equivalently achieve an \approx 5\times acceleration, while attaining a loss comparable to that of a fully trained 60-layer model with 7B parameters, thus demonstrating a mixing behavior in terms of loss. Furthermore, scaling laws on LLAMA3 and DeepSeekV3 models show a 3\sim 5\times improvement in compute efficiency, with an increasing advantage at larger scales.