Research2026-04-17

SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

arXiv:2604.01473v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal features or textual responses to detect malicious...

Read Original Article on Arxiv CS.AI

arxivpapers