BeClaude
Research2026-05-12

SAID: Safety-Aware Intent Defense via Prefix Probing for Large Language Models

Source: Arxiv CS.AI

arXiv:2510.20129v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) remain vulnerable to jailbreak attacks, where adversarially crafted prompts induce policy-violating responses despite safety alignment. Existing defenses typically improve safety through external filtering,...

arxivpaperssafety