Policy2026-05-14

Selective Off-Policy Reference Tuning with Plan Guidance

arXiv:2605.11505v2 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards helps reasoning, but GRPO-style methods stall on hard prompts where all sampled rollouts fail. SORT adds a repair update for those failures without changing rollout generation: it derives a plan from...

Read Original Article on Arxiv CS.AI

arxivpapers