BeClaude
Research2026-05-01

Learning Rate Transfer in Normalized Transformers

Source: Arxiv CS.AI

arXiv:2604.27077v1 Announce Type: cross Abstract: The Normalized Transformer, or nGPT (arXiv:2410.01131) achieves impressive training speedups and does not require weight decay or learning rate warmup. However, despite having hyperparameters that explicitly scale with model size, we observe that...

arxivpapers