Policy2026-04-28
DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training
Source: Arxiv CS.AI
arXiv:2512.03847v2 Announce Type: replace-cross Abstract: Reinforcement learning (RL) has shown strong performance in LLM post-training, but real-world deployment often involves noisy or incomplete supervision. In such settings, complex and unreliable supervision signals can destabilize training...
arxivpapers