Research2026-04-27
When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models
Source: Arxiv CS.AI
arXiv:2510.21285v4 Announce Type: replace Abstract: Large Reasoning Models (LRMs) achieve strong performance on complex multi-step reasoning, yet they still exhibit severe safety failures such as harmful content generation. Existing methods often apply coarse-grained constraints over the entire...
arxivpapersreasoningsafety