BeClaude
Research2026-04-24

Addressing divergent representations from causal interventions on neural networks

Source: Arxiv CS.AI

arXiv:2511.04638v5 Announce Type: replace-cross Abstract: A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create...

arxivpapers