Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR
arXiv:2606.25178v2 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards (RLVR) has been extended from single-domain training to multi-domain reasoning suites spanning mathematics, programming, and science. However, the training curriculum (how often each domain is...
Multi-Domain RLVR: The Next Frontier in Reinforcement Learning
A new preprint on arXiv (2606.25178v2) tackles a critical bottleneck in reinforcement learning with verifiable rewards (RLVR): how to effectively train AI systems across multiple reasoning domains simultaneously. The researchers propose an automated curriculum strategy for multi-domain RLVR, moving beyond the prevailing single-domain training paradigm that has dominated recent advances in reasoning models.
What HappenedThe paper extends RLVR—which uses objective, verifiable rewards (e.g., correct math answers, passing test cases) rather than human feedback—from isolated domains like mathematics or programming to a unified multi-domain reasoning suite. The core innovation is an automated curriculum that dynamically adjusts how often each domain (math, programming, science) is sampled during training. Instead of fixed proportions or human-designed schedules, the curriculum adapts based on the model’s ongoing performance across domains, prioritizing areas where improvement is most needed.
Why It MattersThis work addresses a fundamental limitation of current RLVR approaches. Today’s leading reasoning models (e.g., OpenAI’s o1, DeepSeek-R1) are typically trained on narrow domains, achieving superhuman performance in math or code but struggling with cross-domain generalization. The multi-domain RLVR framework is significant for three reasons:
- Transferability: By training on diverse reasoning tasks simultaneously, models can learn general reasoning heuristics that transfer across domains—a property that single-domain training often fails to produce.
- Curriculum Efficiency: The automated approach eliminates the need for costly human trial-and-error in curriculum design. It mirrors how humans learn: focusing on weaknesses while maintaining strengths.
- Scalability: As AI systems are deployed in increasingly heterogeneous environments (e.g., scientific research assistants that must handle math, code, and domain-specific logic), multi-domain training becomes essential.
For those building or fine-tuning reasoning models, this research suggests several actionable insights:
- Curriculum design matters more than architecture: The paper implies that how you sequence training data can be as impactful as model architecture choices. Practitioners should invest in adaptive sampling strategies rather than static domain mixes.
- Evaluation must be multi-domain: If you’re only testing on math benchmarks, you may miss catastrophic forgetting or poor generalization in other domains. A balanced evaluation suite is now table stakes.
- RLVR is maturing: The shift from single-domain to multi-domain RLVR signals that the field is moving beyond proof-of-concept demonstrations toward practical, general-purpose reasoning systems.
- Compute allocation becomes strategic: Automated curricula introduce a new lever for compute efficiency—spending more resources on domains where the model is currently weak, rather than training uniformly.
Key Takeaways
- Multi-domain RLVR with automated curricula enables general reasoning transfer that single-domain training cannot achieve.
- Dynamic curriculum adaptation outperforms fixed domain sampling, reducing the need for manual hyperparameter tuning.
- Practitioners should adopt multi-domain evaluation suites and consider adaptive training schedules for reasoning models.
- This work marks a maturation of RLVR from narrow benchmarks toward general-purpose reasoning systems.