Research2026-05-08
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
Source: Arxiv CS.AI
arXiv:2605.05227v1 Announce Type: cross Abstract: Data curation is a critical yet under-explored area in large language model (LLM) training. Existing methods, such as data selection and mixing, operate in an offline paradigm, detaching themselves from training. This separation introduces...
arxivpapers