BeClaude
Research2026-04-27

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

Source: Arxiv CS.AI

arXiv:2510.21285v4 Announce Type: replace Abstract: Large Reasoning Models (LRMs) achieve strong performance on complex multi-step reasoning, yet they still exhibit severe safety failures such as harmful content generation. Existing methods often apply coarse-grained constraints over the entire...

arxivpapersreasoningsafety