Research2026-05-01
Debiasing Reward Models via Causally Motivated Inference-Time Intervention
Source: Arxiv CS.AI
arXiv:2604.27495v1 Announce Type: cross Abstract: Reward models (RMs) play a central role in aligning large language models (LLMs) with human preferences. However, RMs are often sensitive to spurious features such as response length. Existing inference-time approaches for mitigating these biases...
arxivpapers