Research2026-05-12
A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment
Source: Arxiv CS.AI
arXiv:2605.06375v1 Announce Type: cross Abstract: Large language model (LLM) alignment via reinforcement learning from human preferences (RLHF) suffers from unstable policy updates, ambiguous gradient directions, poor interpretability, and high gradient variance in mainstream pairwise preference...
arxivpapers