Research2026-05-07
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
Source: Arxiv CS.AI
arXiv:2605.02958v1 Announce Type: cross Abstract: Representation Engineering typically relies on static refusal vectors derived from terminal representations. We move beyond this paradigm, demonstrating that refusal is a dynamic and sparse process rather than a localized outcome. Using Causal...
arxivpapers