BeClaude
Research2026-05-12

Hidden Heroes and Gradient Bloats: Layer-Wise Redundancy Inverts Attribution in Transformers

Source: Arxiv CS.AI

arXiv:2602.01442v3 Announce Type: replace-cross Abstract: Gradient-based attribution is the workhorse of mechanistic interpretability, yet whether it reliably tracks causal importance at the component level remains largely untested. We causally evaluate this assumption across two algorithmic tasks...

arxivpapers