What Does the Weight Norm Control in Grokking? Logit-Scale Mediation under Cross-Entropy
arXiv:2606.18465v1 Announce Type: cross Abstract: Grokking, the delayed jump from memorization to generalization, is usually tied to the weight norm: a smaller norm generalizes sooner. We ask what the norm actually controls. Holding the weight norm fixed by clamping and varying only an output...
What Happened
This new paper from arXiv tackles a persistent mystery in mechanistic interpretability: the phenomenon of grokking, where neural networks suddenly transition from memorizing training data to generalizing correctly after extended training. The standard narrative has been that weight norm—the overall magnitude of learned parameters—is the primary driver. Smaller weight norms correlate with better generalization, and the prevailing view held that weight decay actively suppresses norm to force the network into a simpler, generalizing solution.
The researchers challenge this assumption directly. By clamping weight norms at fixed values and manipulating only the output logits (the raw, unnormalized scores before softmax), they isolate what the norm actually controls. Their key finding: weight norm does not directly cause generalization. Instead, it mediates the scale of logits, which in turn affects how the cross-entropy loss function treats correct versus incorrect predictions. When logits are large, the loss saturates—the network stops receiving strong gradient signals to improve. When logits are small, the loss remains sensitive, continuing to push the model toward better generalization.
In essence, weight norm is a proxy for logit scale, and logit scale is what actually controls the grokking transition. The norm itself is just the lever; the logit scale is the mechanism.
Why It Matters
This distinction is significant for several reasons. First, it refines our understanding of grokking from a vague "norm regularization" story to a precise, testable hypothesis about loss landscape geometry. Researchers can now ask: what training dynamics cause logit scale to shrink or grow, and how does that interact with dataset size, architecture, and optimization?
Second, it has direct implications for training practices. If logit scale—not weight norm per se—is the causal variable, then interventions like logit normalization, temperature scaling, or adaptive loss functions might be more effective than simple weight decay for inducing generalization. Practitioners chasing grokking in small models or synthetic tasks can now target the right knob.
Third, this work strengthens the case for mechanistic interpretability as a rigorous science. Rather than accepting surface-level correlations (smaller norm → better generalization), the authors perform controlled experiments to isolate causation. This methodological standard is exactly what the field needs to move beyond storytelling.
Implications for AI Practitioners
For those training small transformers or studying emergence, this paper suggests a practical shift: monitor logit scale dynamics, not just weight norm curves. If your model is stuck in memorization, artificially suppressing logit magnitudes (e.g., via logit clipping or lower initializations) may trigger generalization faster than waiting for weight decay to act.
For safety and alignment researchers studying grokking in larger models, the finding implies that grokking may be more controllable than previously thought. Rather than relying on slow norm decay, one could design training loops that actively regulate logit scale—potentially accelerating the transition to generalization in safety-critical tasks.
Finally, the paper underscores that cross-entropy loss creates a hidden coupling between parameter magnitude and gradient signal. Practitioners should be aware that seemingly benign choices—like output layer initialization scale or learning rate—can dramatically alter whether a model memorizes or generalizes, through their effect on logit scale.
Key Takeaways
- Weight norm is not the direct cause of grokking; it controls logit scale, which in turn determines how cross-entropy loss drives generalization.
- Clamping weight norms while varying logit scale reveals that small logits (not small norms) are the true prerequisite for the generalization transition.
- Practitioners should monitor and potentially regulate logit scale directly, rather than relying solely on weight decay to induce grokking.
- This work sets a higher standard for causal experiments in mechanistic interpretability, moving beyond correlation to controlled intervention.