Research2026-05-06
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
Source: Arxiv CS.AI
arXiv:2605.01899v1 Announce Type: new Abstract: The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially high-risk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to...
arxivpaperssafety