Research2026-05-05
Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails
Source: Arxiv CS.AI
arXiv:2603.18280v3 Announce Type: replace-cross Abstract: Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study...
arxivpapers