Research2026-04-24

Addressing divergent representations from causal interventions on neural networks

arXiv:2511.04638v5 Announce Type: replace-cross Abstract: A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create...

Read Original Article on Arxiv CS.AI

arxivpapers