Research2026-05-12

Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

arXiv:2605.08878v1 Announce Type: cross Abstract: Aligned large language models (LLMs) remain vulnerable to jailbreak attacks. Recent mechanistic studies have identified latent features and representation shifts associated with jailbreak success, but they leave a more fundamental question open: why...

Read Original Article on Arxiv CS.AI

arxivpaperssafety