Research2026-05-12

A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment

arXiv:2605.06375v1 Announce Type: cross Abstract: Large language model (LLM) alignment via reinforcement learning from human preferences (RLHF) suffers from unstable policy updates, ambiguous gradient directions, poor interpretability, and high gradient variance in mainstream pairwise preference...

Read Original Article on Arxiv CS.AI

arxivpapers