BeClaude
Policy2026-06-18

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Source: Arxiv CS.AI

arXiv:2606.19236v1 Announce Type: cross Abstract: Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order...

A New Approach to Stabilizing Policy Entropy in LLM Reasoning

The paper "STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability" tackles a critical failure mode in reinforcement learning for large language models: entropy collapse. This occurs when models trained with reward-based methods like GRPO (Group Relative Policy Optimization) become overly confident in their outputs, losing the exploratory diversity essential for complex reasoning tasks.

What the Research Proposes

The authors identify that standard policy gradient methods treat all tokens within a generated sequence with uniform advantage weighting, which inadvertently penalizes exploratory tokens while rewarding only the final outcome. STARE introduces a token-level reweighting mechanism based on surprisal—the model's own predictive uncertainty at each token position. Tokens that are more surprising (i.e., have lower probability under the current policy) receive higher advantage weights, encouraging the model to maintain diversity in its reasoning paths without sacrificing performance.

This is a first-order correction to the entropy collapse problem, meaning it directly addresses the gradient dynamics rather than applying post-hoc regularization. The method operates at the granularity of individual tokens, which is computationally efficient and integrates naturally with existing GRPO-based pipelines.

Why This Matters

Entropy collapse is not a minor technical nuisance—it is a fundamental barrier to scaling LLM reasoning. When a model's policy entropy drops too quickly, it begins to repeat narrow reasoning patterns, fails to explore alternative solution paths, and becomes brittle when faced with out-of-distribution problems. This is particularly problematic for domains like mathematics, code generation, and scientific reasoning where multiple valid approaches exist.

Current mitigation strategies—such as KL regularization or entropy bonuses—are coarse instruments that often trade off exploration for stability. STARE offers a more principled solution by tying the exploration signal directly to the model's own uncertainty, creating a self-regulating feedback loop. The approach is also notable for its simplicity: it does not require additional reward models, human feedback, or architectural changes.

Implications for AI Practitioners

For teams training reasoning models, STARE suggests that token-level dynamics matter more than previously appreciated. Practitioners should consider:

  • Monitoring token-level surprisal during training as an early indicator of entropy collapse, rather than relying solely on aggregate metrics like average entropy.
  • Revisiting advantage calculation in existing GRPO implementations. Uniform token weighting may be silently degrading reasoning diversity.
  • Evaluating reasoning robustness beyond final accuracy. Models trained with STARE may show better performance on adversarial or ambiguous prompts.
The paper also implies that the next frontier in RL for LLMs is not just better reward models, but better credit assignment at the token level. STARE provides a concrete, implementable step in that direction.

Key Takeaways

  • STARE addresses entropy collapse in LLM reasoning by reweighting token-level advantages based on model surprisal, preserving exploratory diversity.
  • The method is a first-order correction that integrates with existing GRPO pipelines without requiring additional models or human feedback.
  • Token-level dynamics are critical for reasoning robustness; practitioners should monitor surprisal and revisit uniform advantage weighting.
  • This work points toward a broader shift in RL for LLMs: from outcome-level to token-level credit assignment for more stable and capable reasoning.
arxivpapersstability-ai