Research2026-05-14

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

arXiv:2605.11679v2 Announce Type: replace Abstract: In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing...

Read Original Article on Arxiv CS.AI

arxivpaperssafety