Research2026-05-12

Why Adam Works Better with $\beta_1 = \beta_2$: The Missing Gradient Scale Invariance Principle

arXiv:2601.21739v2 Announce Type: replace-cross Abstract: Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters...

Read Original Article on Arxiv CS.AI

arxivpapers