BeClaude
Research2026-05-06

Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

Source: Arxiv CS.AI

arXiv:2605.01899v1 Announce Type: new Abstract: The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially high-risk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to...

arxivpaperssafety