BeClaude
Back to News
Research2026-04-17

How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

Source: Arxiv CS.AI

arXiv:2604.13977v1 Announce Type: cross Abstract: Synthetic data is a standard component in training large language models, yet systematic comparisons across design dimensions, including rephrasing strategy, generator model, and source data, remain absent. We conduct extensive controlled...

arxivpapersprompting