Research2026-05-12
MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation
Source: Arxiv CS.AI
arXiv:2511.07833v3 Announce Type: replace-cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard recipe for post-training LLMs on reasoning tasks, with Group Relative Policy Optimization (GRPO) emerging as a leading approach. However, GRPO and its variants are...
arxivpapers