Research2026-05-06

Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

arXiv:2605.01899v1 Announce Type: new Abstract: The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially high-risk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to...

Read Original Article on Arxiv CS.AI

arxivpaperssafety