Research2026-05-12
Why Adam Works Better with $\beta_1 = \beta_2$: The Missing Gradient Scale Invariance Principle
Source: Arxiv CS.AI
arXiv:2601.21739v2 Announce Type: replace-cross Abstract: Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters...
arxivpapers