Policy2026-05-05
How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models
Source: Arxiv CS.AI
arXiv:2604.04385v4 Announce Type: replace-cross Abstract: We localize the policy routing mechanism in alignment-trained language models. An intermediate-layer attention gate reads detected content and triggers deeper amplifier heads that boost the signal toward refusal. In smaller models the gate...
arxivpapers