Towards Engineering Scaling Laws with Pretraining Data Composition
arXiv:2606.19781v1 Announce Type: cross Abstract: Neural scaling laws describe how model performance improves as a power law in compute, model size, and dataset size. While well-established for large language models, these relationships are emerging for large models in particle physics. As with...
Scaling Laws Cross Disciplines: What Particle Physics Teaches Us About Pretraining Data Composition
A new preprint on arXiv (2606.19781v1) extends the concept of neural scaling laws beyond language models into particle physics, specifically examining how pretraining data composition affects model performance. The research builds on established findings that model performance follows power-law relationships with compute, model size, and dataset size—but now applies these principles to large models trained on physics data from particle colliders.
What Happened
The researchers investigated whether scaling laws observed in large language models (LLMs) hold for transformer-based models trained on high-energy physics (HEP) data. Crucially, they moved beyond simple scaling with total dataset size to examine how the composition of pretraining data—mixing different types of particle collision events—influences downstream performance. This represents a significant departure from typical scaling law studies, which often treat data as a homogeneous resource.
The work suggests that for physics applications, the distribution of training examples across different physical processes matters as much as raw data volume. This introduces a new variable into scaling law formulations: data composition ratios.
Why It Matters
This research has three major implications:
First, it validates that scaling laws are not unique to natural language processing. If similar power-law relationships govern model performance in particle physics, it suggests a deeper universality in how neural networks learn from structured data. This strengthens the case for treating scaling laws as fundamental properties of deep learning rather than quirks of specific domains. Second, it introduces data composition as a controllable scaling parameter. In LLM development, practitioners often rely on heuristic data mixing strategies (e.g., balancing code, books, and web text). This paper provides a formal framework for optimizing those mixtures, potentially replacing intuition with mathematical guidance. Third, it highlights an emerging challenge: as models grow, the marginal value of additional data depends heavily on what data you already have. For particle physics, adding more of the most common collision types yields diminishing returns compared to including rare but informative events.Implications for AI Practitioners
For those training large models, this work suggests several actionable insights:
- Data composition deserves scaling law treatment. Practitioners should treat data mixing ratios as hyperparameters to be optimized, not fixed by convention. Running controlled experiments on mixture proportions at small scale may predict optimal compositions at large scale.
- Domain-specific scaling laws may differ from NLP benchmarks. The power-law exponents found in particle physics may not match those in language modeling. Teams should derive their own scaling coefficients rather than relying on published LLM results.
- Rare data can be disproportionately valuable. In physics, rare collision events carry high information density. The same likely applies to other domains: edge cases, anomalies, and underrepresented patterns may contribute more per sample than common examples.
Key Takeaways
- Neural scaling laws extend beyond language models to particle physics, with data composition emerging as a critical new variable alongside compute, model size, and dataset size
- Pretraining data mixture ratios can be formally optimized using scaling law frameworks, potentially replacing heuristic approaches to data selection
- The marginal value of additional data depends heavily on composition, with rare or informative examples often contributing more per sample than abundant ones
- Practitioners should derive domain-specific scaling coefficients rather than assuming universal applicability of NLP-derived scaling laws