Policy2026-05-14
Selective Off-Policy Reference Tuning with Plan Guidance
Source: Arxiv CS.AI
arXiv:2605.11505v2 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards helps reasoning, but GRPO-style methods stall on hard prompts where all sampled rollouts fail. SORT adds a repair update for those failures without changing rollout generation: it derives a plan from...
arxivpapers