Research2026-04-23
Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection
Source: Arxiv CS.AI
arXiv:2604.20549v1 Announce Type: cross Abstract: As Large Language Models (LLMs) scale, data curation has shifted from maximizing volume to optimizing the signal-to-noise ratio by performing quality filtering. However, for many languages, native high quality data is insufficient to train robust...
arxivpapers