Research2026-05-05

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

arXiv:2605.00123v1 Announce Type: new Abstract: Safety trained large language models (LLMs) can often be induced to answer harmful requests through jailbreak prompts. Because we lack a robust understanding of why LLMs are susceptible to jailbreaks, future frontier models operating more autonomously...

Read Original Article on Arxiv CS.AI

arxivpapers