BeClaude
Research2026-05-11

Mitigating Cognitive Bias in RLHF by Altering Rationality

Source: Arxiv CS.AI

arXiv:2605.06895v1 Announce Type: new Abstract: How can we make models robust to even imperfect human feedback? In reinforcement learning from human feedback (RLHF), human preferences over model outputs are used to train a reward model that assigns scalar values to responses. Because these rewards...

arxivpapers