Which Tokens Matter? Adaptive Token Selection for RLVR with the Relative Surprisal Index
arXiv:2606.31575v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a powerful tool for propelling Large Language Models (LLMs) beyond imitation-based training towards more robust reasoning capabilities. Among existing approaches, RL with Verifiable Rewards (RLVR) has emerged as...
A Smarter Way to Reward Reasoning
A new preprint from arXiv (2606.31575v1) tackles a fundamental inefficiency in how reinforcement learning (RL) is applied to large language models (LLMs). The core problem is straightforward: when training an LLM with RL using verifiable rewards (RLVR), not all tokens in a generated sequence are equally informative for learning. Many tokens—like filler words, repeated phrases, or boilerplate reasoning steps—contribute little signal but still consume compute and can even introduce noise into the reward signal. The authors propose a solution called the Relative Surprisal Index (RSI) , an adaptive token selection mechanism that dynamically identifies which tokens actually matter for reward computation.
The key insight is that tokens carrying high "surprisal"—meaning they are statistically unexpected given the preceding context—tend to be the ones where the model is making meaningful reasoning decisions. A token that is highly predictable (low surprisal) is likely just executing a routine pattern, not exploring a new reasoning path. By filtering training to focus on high-surprisal tokens, RSI reduces the effective sequence length for reward computation without sacrificing learning quality. The paper reports that this approach maintains or improves RLVR performance while significantly cutting computational overhead.
Why This Matters
This research addresses a growing pain point in the LLM alignment community. As models scale and reasoning tasks become more complex, RLVR has emerged as a promising alternative to supervised fine-tuning for instilling robust reasoning. However, the computational cost of RL—especially the repeated sampling and reward evaluation—remains a barrier. Every token in every generated sequence must be processed, even when most tokens are trivial. RSI offers a principled, theoretically grounded way to prune that waste.
For AI practitioners, the implications are practical and immediate. First, training efficiency improves—fewer tokens means faster gradient updates and lower GPU hours. Second, reward signals become cleaner by filtering out noise from low-information tokens, which could lead to more stable training curves and better final model performance. Third, the approach is model-agnostic and can be layered on top of existing RLVR pipelines without architectural changes.
Implications for AI Practitioners
- Cost reduction: Teams running RLVR at scale can expect meaningful savings in compute, especially for long-context reasoning tasks where most tokens are predictable.
- Better reasoning models: By focusing reward on genuinely surprising reasoning steps, the model may learn more robust decision-making rather than memorizing surface patterns.
- Implementation simplicity: RSI requires only a lightweight surprisal calculation from the model's own logits—no external classifiers or auxiliary models needed.
Key Takeaways
- The Relative Surprisal Index (RSI) selectively focuses RLVR training on tokens that are statistically unexpected, filtering out low-information tokens.
- This approach reduces computational overhead while maintaining or improving reasoning performance in LLMs.
- RSI is a practical, model-agnostic technique that can be integrated into existing RL training pipelines with minimal engineering effort.
- For AI teams, this represents a concrete path to more efficient and potentially more capable reasoning models without scaling up hardware.