Research2026-07-03

HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

Originally published byArxiv CS.AI

arXiv:2607.02266v1 Announce Type: cross Abstract: Most data-mixing methods assume the corpus has already been partitioned into groups, and the choice of those groups determines what a mixer can express. Existing labels, including provenance, topic or format taxonomies, and flat embedding clusters,...

The Labeling Bottleneck in Data Mixing

A new paper, HERMES, tackles a fundamental but often overlooked problem in AI training: how to label and organize the massive, heterogeneous datasets that go into modern language models. Current approaches to data mixing—deciding what proportions of different data types to include—rely on coarse, pre-existing groupings like "Wikipedia," "Reddit," or "books." These provenance-based labels are blunt instruments. A single Wikipedia article might contain code snippets, mathematical proofs, and historical narratives, yet it gets lumped into one bucket. HERMES proposes a more granular, systematic substrate for labeling data mixtures.

The core innovation is multi-granularity labeling. Instead of assuming a fixed taxonomy (e.g., topic or format), HERMES generates labels at multiple levels of abstraction simultaneously—from broad domains down to fine-grained semantic clusters. This allows a data mixer to weigh "all scientific text" at one level while also distinguishing "physics proofs" from "biology abstracts" at another. The paper demonstrates that this layered approach enables more precise control over training mixtures than flat clustering or single-taxonomy systems.

Why This Matters

The significance lies in the hidden cost of poor data labels. Most data-mixing research assumes the hard work of partitioning is already done. In practice, that partitioning is arbitrary and lossy. If you mix data based on "source" alone, you might accidentally over-represent certain writing styles or under-represent rare but valuable knowledge domains. HERMES addresses this by making the labeling itself a first-class component of the mixing pipeline.

For AI practitioners, this shifts the conversation from "how much data from each source?" to "what semantic and structural properties should our training mixture have?" That is a more powerful framing. It also has implications for data deduplication and quality filtering—if you can label at multiple granularities, you can remove redundant content without losing diversity.

Implications for Practitioners

First, expect a move toward richer metadata pipelines. Training teams will need to invest in automated labeling systems that can produce hierarchical annotations, not just flat tags. Second, the paper suggests that optimal data mixtures may be more complex than simple scaling laws imply—the structure of labels matters as much as the volume of data. Third, this approach could improve reproducibility: if mixture recipes are defined by granular labels rather than opaque source lists, other teams can reconstruct similar training distributions.

The main challenge is computational cost. Multi-granularity labeling requires clustering and classification at scale, which adds overhead to an already expensive pre-training process. However, as models grow and data budgets tighten, the marginal benefit of smarter mixing likely outweighs the labeling cost.

Key Takeaways

HERMES introduces a multi-granularity labeling system that replaces coarse provenance-based data grouping with layered semantic labels, enabling more precise control over training mixtures.
The approach addresses a critical blind spot: most data mixing research assumes pre-existing partitions are optimal, but those partitions often obscure important data properties.
For practitioners, this means investing in richer metadata pipelines and rethinking mixture design from "source proportions" to "semantic composition."
The trade-off is increased upfront labeling cost, but the potential gains in training efficiency and model quality make this a worthwhile direction for frontier AI labs.

Read Original Article on Arxiv CS.AI

arxivpapers