BeClaude
Back to News
Research2026-04-17

SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

Source: Arxiv CS.AI

arXiv:2604.01473v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal features or textual responses to detect malicious...

arxivpapers