Research2026-05-05

Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

arXiv:2603.18280v3 Announce Type: replace-cross Abstract: Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study...

Read Original Article on Arxiv CS.AI

arxivpapers