Research2026-05-08
Optimal Transport for LLM Reward Modeling from Noisy Preference
Source: Arxiv CS.AI
arXiv:2605.06036v1 Announce Type: cross Abstract: Reward models are fundamental to Reinforcement Learning from Human Feedback (RLHF), yet real-world datasets are inevitably corrupted by noisy preference. Conventional training objectives tend to overfit these errors, while existing denoising...
arxivpapers