Policy2026-05-08

P^2O: Joint Policy and Prompt Optimization

arXiv:2603.21877v3 Announce Type: replace-cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) enhances Large Language Model (LLM) reasoning but suffers from advantage collapse on ``hard samples'' where all rollouts fail. This lack of variance eliminates crucial learning signals....

Read Original Article on Arxiv CS.AI

arxivpapersprompting