Research2026-04-28
Jailbreaking Frontier Foundation Models Through Intention Deception
Source: Arxiv CS.AI
arXiv:2604.24082v1 Announce Type: cross Abstract: Large (vision-)language models exhibit remarkable capability but remain highly susceptible to jailbreaking. Existing safety training approaches aim to have the model learn a refusal boundary between safe and unsafe, based on the user's intent. It...
arxivpapers