Research2026-05-12
Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off
Source: Arxiv CS.AI
arXiv:2605.08878v1 Announce Type: cross Abstract: Aligned large language models (LLMs) remain vulnerable to jailbreak attacks. Recent mechanistic studies have identified latent features and representation shifts associated with jailbreak success, but they leave a more fundamental question open: why...
arxivpaperssafety