Research2026-04-22

Cat-DPO: Category-Adaptive Safety Alignment

arXiv:2604.17299v2 Announce Type: replace-cross Abstract: Aligning large language models with human preferences must balance two competing goals: responding helpfully to legitimate requests and reliably refusing harmful ones. Most preference-based safety alignment methods collapse safety into a...

Read Original Article on Arxiv CS.AI

arxivpaperssafety