BeClaude
Research2026-05-07

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

Source: Arxiv CS.AI

arXiv:2605.02958v1 Announce Type: cross Abstract: Representation Engineering typically relies on static refusal vectors derived from terminal representations. We move beyond this paradigm, demonstrating that refusal is a dynamic and sparse process rather than a localized outcome. Using Causal...

arxivpapers