Research2026-05-06
Efficient Preference Poisoning Attack on Offline RLHF
Source: Arxiv CS.AI
arXiv:2605.02495v1 Announce Type: cross Abstract: Offline Reinforcement Learning from Human Feedback (RLHF) pipelines such as Direct Preference Optimization (DPO) train on a pre-collected preference dataset, which makes them vulnerable to preference poisoning attack. We study label flip attacks...
arxivpapers