Back to News
Research2026-04-17
How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data
Source: Arxiv CS.AI
arXiv:2604.13977v1 Announce Type: cross Abstract: Synthetic data is a standard component in training large language models, yet systematic comparisons across design dimensions, including rephrasing strategy, generator model, and source data, remain absent. We conduct extensive controlled...
arxivpapersprompting