Policy2026-05-11
Gradient Extrapolation-Based Policy Optimization
Source: Arxiv CS.AI
arXiv:2605.06755v1 Announce Type: cross Abstract: Reinforcement learning is widely used to improve the reasoning ability of large language models, especially when answers can be automatically checked. Standard GRPO-style training updates the model using only the current step, while full multi-step...
arxivpapers