Scaling Exponents Across Parameterizations and Optimizers

0citations

PDF

Citations

#10

in ICML 2024

of 2635 papers

Authors

Data Points

Authors

Katie Everett Lechao Xiao Mitchell Wortsman Alexander Alemi Roman Novak Peter Liu Izzeddin Gur Jascha Sohl-Dickstein Leslie Kaelbling Jaehoon Lee Jeffrey Pennington

Topics

scaling exponents parameterization methods optimizer design learning rate scaling hyperparameter transfer maximal update parameterization adam optimizer variants numerical stability optimization

Abstract

Robust and effective scaling of models from small to large width typically requires the precise adjustment of many algorithmic and architectural details, such as parameterization and optimizer choices. In this work, we propose a new perspective on parameterization by investigating a key assumption in prior work about the alignment between parameters and data and derive new theoretical results under weaker assumptions and a broader set of optimizers. Our extensive empirical investigation includestens of thousandsof models trained withall combinations ofthree optimizers, four parameterizations, several alignment assumptions, more than a dozen learning rates, and fourteen model sizes up to 27B parameters. We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work. Our results show that all parameterizations, not just maximal update parameterization (muP), can achieve hyperparameter transfer; moreover, our novel per-layer learning rate prescription for standard parameterization outperforms muP. Finally, we demonstrate that an overlooked aspect of parameterization, the epsilon parameter in Adam, must be scaled correctly to avoid gradient underflow and proposeAdam-atan2, a new numerically stable, scale-invariant version of Adam that eliminates the epsilon hyperparameter entirely.

Citation History

Jan 28, 2026