Skip to content
BeClaude
Research2026-06-30

On the Nonlinearity of Learning Rate Scaling for LLM Training

Originally published byArxiv CS.AI

arXiv:2606.29158v1 Announce Type: cross Abstract: Learning-rate transfer can reduce the cost of training large language models: instead of sweeping learning rates at target scale, practitioners extrapolate from smaller runs. Existing approaches often assume that the optimal learning rate follows a...

The assumption that optimal hyperparameters, particularly the learning rate, scale predictably with model size has been a cornerstone of efficient large language model (LLM) training. A new preprint, "On the Nonlinearity of Learning Rate Scaling for LLM Training," directly challenges this orthodoxy. The authors provide empirical and theoretical evidence that the relationship between model scale and the optimal learning rate is not a simple power law—it is nonlinear and can even be non-monotonic.

What Happened

The paper systematically investigates the "learning rate transfer" hypothesis, which posits that the optimal learning rate found in small-scale training runs can be reliably extrapolated to much larger models. By conducting extensive sweeps across models ranging from 70 million to over 1 billion parameters, the researchers discovered that the optimal learning rate does not follow a smooth, predictable curve. Instead, it exhibits significant nonlinearities: as model size increases, the optimal learning rate can plateau, dip, or even rise again depending on the specific architecture, data distribution, and training horizon. The study identifies that factors such as batch size, the gradient noise scale, and the model's width-to-depth ratio create complex interactions that break the simple scaling laws previously assumed.

Why It Matters

This finding has immediate and practical consequences for the AI industry. The prevailing wisdom—that you can run a small model, find the best learning rate, and then apply that rate to a 10x or 100x larger model—is now shown to be unreliable. If the optimal learning rate is nonlinear, practitioners who blindly follow this rule risk training at suboptimal efficiency. This means wasted compute, slower convergence, and potentially worse final model quality. The paper effectively undermines a key cost-saving heuristic that many labs have relied upon to avoid expensive hyperparameter sweeps at scale.

For frontier labs, this is a wake-up call. The assumption of linear transfer has allowed teams to amortize the cost of hyperparameter tuning across multiple model sizes. If that assumption is broken, the cost of finding the optimal learning rate for a 100-billion-parameter model increases dramatically, potentially requiring full-scale sweeps that can cost millions of dollars. The research suggests that the "free lunch" of cheap hyperparameter transfer is not universally available.

Implications for AI Practitioners

For engineers and researchers, the immediate takeaway is caution. Do not assume that a learning rate that worked for a 1B model will be optimal for a 7B or 70B model. The paper recommends several mitigations:

  • Validate transfer at intermediate scales: Instead of jumping directly from a tiny proxy to the target, validate the learning rate at one or two intermediate model sizes.
  • Monitor gradient statistics: The nonlinearity is linked to the gradient noise scale. Practitioners should track this metric during small-scale runs to better predict when the learning rate will deviate from expectations.
  • Reconsider sweep budgets: Budget for at least a partial learning rate sweep at the target scale, rather than relying entirely on transfer.
This research does not invalidate scaling laws, but it refines them. It shows that scaling is a richer, more complex phenomenon than a simple power-law curve. For those building the next generation of LLMs, this paper is a crucial reminder that the cheapest path is not always the safest.

Key Takeaways

  • The optimal learning rate for LLMs does not follow a simple, monotonic scaling law; it exhibits significant nonlinearities that break the assumption of easy transfer from small to large models.
  • Relying on learning rate transfer without validation at intermediate scales risks significant training inefficiency and wasted compute.
  • Practitioners should monitor gradient noise scale and validate at multiple model sizes to avoid suboptimal hyperparameter choices.
  • The paper increases the cost and complexity of hyperparameter tuning for large models, challenging a key cost-saving heuristic in the industry.
arxivpapers