Sparsity Curse: Understanding RLVR Model Parameter Space from Model Merging
arXiv:2606.18521v1 Announce Type: cross Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful post-training paradigm that surpasses Supervised Fine-Tuning (SFT) in eliciting reasoning intelligence and resisting catastrophic forgetting. Recent studies further...
What Happened
A new arXiv preprint (2606.18521) investigates a fundamental challenge in Reinforcement Learning with Verifiable Reward (RLVR) models: the "sparsity curse" in parameter space. The research examines how RLVR-trained models—which have shown superior reasoning capabilities and resistance to catastrophic forgetting compared to Supervised Fine-Tuning (SFT)—behave when subjected to model merging techniques. The study reveals that RLVR models occupy a surprisingly sparse and fragmented parameter subspace, making naive merging approaches (like weight averaging or task vector arithmetic) significantly less effective than they are for SFT models.
The authors demonstrate that the parameter configurations learned during RLVR training are not smoothly distributed but instead form isolated "islands" of high performance. When merging two RLVR models, the resulting interpolated parameters often fall into low-performance valleys between these islands, degrading both reasoning accuracy and reward alignment.
Why It Matters
This finding has profound implications for the practical deployment of RLVR models. The sparsity curse suggests that the very mechanism enabling RLVR's superior reasoning—its ability to explore and settle into sharp, narrow optima during reinforcement learning—creates a trade-off: these models are brittle to parameter-space interpolation.
For AI practitioners, this means that common workflows built around model merging (e.g., combining domain-specific experts, creating multi-task models, or performing model soups for robustness) may fail unexpectedly with RLVR models. The paper effectively warns that the "merge-and-fine-tune" paradigm that works well for SFT models cannot be directly transferred to the RLVR regime.
Additionally, this sparsity explains why RLVR models often exhibit sudden performance cliffs rather than graceful degradation when modified—a phenomenon many practitioners have observed anecdotally but lacked a theoretical explanation for.
Implications for AI Practitioners
Merging strategies need rethinking: Practitioners should not expect linear interpolation or simple task vector addition to work for RLVR models. The research suggests that more sophisticated merging techniques—such as regularization-aware interpolation, gradient-based alignment, or iterative merging with intermediate RLVR steps—may be necessary. Evaluation protocols must adapt: When deploying RLVR models, teams should test not just final performance but also sensitivity to parameter perturbations. A model that scores 90% on reasoning benchmarks but collapses under a 1% weight perturbation is operationally fragile. Training pipeline design: The sparsity curse may be mitigated by intentionally broadening the parameter space during RLVR training—for example, by adding noise, using wider networks, or employing regularization that encourages flatter minima. This trades off some peak performance for mergeability and robustness. Monitoring for catastrophic interference: When combining RLVR models, practitioners should monitor for sudden drops in verifiable reward metrics, as these indicate the merged model has fallen outside the sparse high-performance subspace.Key Takeaways
- RLVR models occupy sparse, isolated parameter subspaces, making naive model merging significantly less effective than for SFT models
- The sparsity curse explains observed performance cliffs and brittleness in RLVR models when modified or combined
- Practitioners must adopt specialized merging techniques (e.g., regularization-aware interpolation) rather than relying on standard task vector arithmetic
- Training strategies that encourage flatter minima may improve mergeability at the cost of some peak reasoning performance