SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR
arXiv:2606.18487v1 Announce Type: cross Abstract: The standard heuristic of selecting the SFT checkpoint with the highest pass@1 for GRPO can fail when SFT compresses the rollout distribution. For binary rewards, the expected within group advantage variance is $p(1{-}p)(g{-}1)/g$; when early GRPO...
The Hidden Danger of Overtrained SFT in Reinforcement Learning Pipelines
A new paper from arXiv (2606.18487v1) exposes a critical failure mode in how AI practitioners currently select supervised fine-tuning (SFT) checkpoints before applying reinforcement learning with verifiable rewards (RLVR). The research demonstrates that the common practice of picking the SFT checkpoint with the highest pass@1 accuracy can actually backfire, leading to a phenomenon called "rank inversion" where models that perform better on initial SFT metrics end up performing worse after RLVR training.
The core mechanism involves entropy collapse. When SFT is overtrained, it compresses the model's rollout distribution—meaning the model becomes too confident in a narrow set of outputs. For binary reward tasks, the researchers show that the expected within-group advantage variance follows a specific formula: p(1-p)(g-1)/g. This variance is crucial because it determines how effectively GRPO (Group Relative Policy Optimization) can differentiate between good and bad responses. When SFT overtraining collapses this variance, the RLVR signal becomes weak or misleading, causing the model to plateau or even regress.
Why This Matters
This finding challenges a deeply ingrained heuristic in the AI training community: that better SFT always means better final performance. Many teams invest heavily in maximizing SFT accuracy, only to find their RLVR fine-tuning yields diminishing returns. The paper provides a mathematical explanation for why this happens—overtrained SFT models have less "exploratory slack" in their output distributions, making them resistant to the kind of reward-driven shaping that RLVR provides.
The implications are particularly acute for competitive AI labs racing to improve reasoning benchmarks. If your SFT checkpoint is too polished, you may be inadvertently sabotaging your RLVR phase. The research suggests that optimal SFT for RLVR might look worse on traditional metrics but maintain higher entropy in its output distribution.
Practical Implications for AI Practitioners
First, teams should re-evaluate their SFT checkpoint selection criteria. Instead of solely maximizing pass@1, consider metrics that measure distributional diversity or entropy. Second, the paper provides a theoretical framework for predicting when rank inversion will occur—practitioners can use the variance formula to estimate whether their SFT model is "too compressed" for effective RLVR. Third, this work suggests that early stopping during SFT might be more beneficial than previously thought, especially when RLVR is the intended next step.
Key Takeaways
- SFT overtraining can cause rank inversion: Selecting the highest pass@1 SFT checkpoint may lead to worse RLVR outcomes due to entropy collapse in the rollout distribution.
- Distributional diversity matters: The variance of within-group advantages (p(1-p)(g-1)/g) is a critical predictor of RLVR success, not just absolute accuracy.
- Practitioners need new selection criteria: Teams should incorporate entropy or distributional spread metrics when choosing SFT checkpoints for RLVR pipelines.
- Early stopping may be optimal: The best SFT checkpoint for RLVR might be one that appears weaker on traditional benchmarks but preserves output diversity.