How Width and Data Shape Generalization Scaling Laws in Quadratic Neural Networks
arXiv:2606.28242v1 Announce Type: cross Abstract: Understanding how performance scales jointly with model size and data is a central problem in modern machine learning. Existing theoretical works on scaling laws typically describe generalization as a function of data or compute, often in...
A New Lens on Scaling Laws: The Role of Width in Quadratic Neural Networks
A recent preprint on arXiv (2606.28242) tackles one of the most pressing questions in AI research: how do generalization performance and model size actually scale together with data? While the paper focuses on a specific class of models—quadratic neural networks—its findings offer a fresh theoretical perspective on the interplay between network width and dataset size, moving beyond the common practice of treating scaling as a simple function of total parameters or compute.
What the Research Reveals
The authors derive analytical scaling laws for quadratic neural networks, which are a simplified but instructive architecture where the activation function is a quadratic polynomial. Their key contribution is showing that the generalization error scales differently depending on the width of the network relative to the number of training samples. Specifically, they identify two distinct regimes: a "data-limited" regime where performance improves primarily by adding more data, and a "width-limited" regime where increasing model capacity yields diminishing returns unless data also grows. This joint dependence is more nuanced than the widely cited "scaling laws" from Kaplan et al. or Hoffmann et al., which often treat model size and data as independent factors.
Crucially, the paper provides closed-form expressions for the test error as a function of both width and dataset size, revealing that the optimal allocation of resources (parameters vs. data) is not fixed but depends on the target error level. For example, when the desired error is very low, the model width must grow faster than the data to maintain optimal scaling.
Why This Matters for the Field
This work is significant because it moves scaling law theory closer to the realities of modern deep learning. Most existing theoretical analyses either assume infinite data or focus on linear models, leaving a gap between theory and practice. By studying a nonlinear but tractable architecture, the authors demonstrate that width—not just total parameter count—is a critical control knob. This has direct implications for understanding why overparameterized models generalize well: the paper shows that wide networks can "memorize" less and generalize better when data is abundant, but they also require careful data scaling to avoid overfitting.
For AI practitioners, the message is clear: blindly scaling up model width without proportionally increasing data can lead to suboptimal performance. The findings suggest that when designing experiments or production systems, one should jointly optimize the ratio of width to dataset size, rather than treating them as independent hyperparameters.
Implications for AI Practitioners
- Resource allocation: When compute is limited, it may be more efficient to increase data before width, especially if the target accuracy is moderate.
- Architecture design: The results reinforce the value of wide, shallow networks in data-rich regimes, challenging the assumption that depth is always superior.
- Benchmarking: Future scaling law studies should report both model width and dataset size separately, not just total parameters or FLOPs.
Key Takeaways
- Generalization in quadratic neural networks follows a joint scaling law where width and data size interact non-trivially, not independently.
- Two distinct regimes exist: data-limited (improve with more samples) and width-limited (improve with more parameters), with optimal allocation depending on target error.
- Practitioners should jointly optimize the width-to-data ratio rather than scaling either dimension in isolation.
- This work provides a theoretical foundation for understanding when and why overparameterization helps generalization, with direct implications for resource-efficient model design.