Understanding Adam Requires Better Rotation Dependent Assumptions

6citations

arXiv:2410.19964

Citations

#458

in NeurIPS 2025

of 5858 papers

Authors

Data Points

Authors

Tianyue Zhang Lucas Maes Alan Milligan Alexia Jolicoeur-Martineau Ioannis Mitliagkas Damien Scieur Simon Lacoste-Julien Charles Guille-Escuret

Topics

optimizer sensitivity analysis parameter space rotations transformer training rotation-dependent assumptions adam optimizer theory update orthogonality basis choice effects

Abstract

Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This paper investigates Adam's sensitivity to rotations of the parameter space. We observe that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis in practice. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature and find that they fall short in explaining Adam's behaviour across various rotation types. In contrast, we verify the orthogonality of the update as a promising indicator of Adam's basis sensitivity, suggesting it may be the key quantity for developing rotation-dependent theoretical frameworks that better explain its empirical success.

Citation History

Jan 25, 2026

Jan 27, 2026

Jan 31, 2026

6+6