Policy2026-05-05

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

arXiv:2604.04385v4 Announce Type: replace-cross Abstract: We localize the policy routing mechanism in alignment-trained language models. An intermediate-layer attention gate reads detected content and triggers deeper amplifier heads that boost the signal toward refusal. In smaller models the gate...

Read Original Article on Arxiv CS.AI

arxivpapers