Research2026-05-12

Hidden Heroes and Gradient Bloats: Layer-Wise Redundancy Inverts Attribution in Transformers

arXiv:2602.01442v3 Announce Type: replace-cross Abstract: Gradient-based attribution is the workhorse of mechanistic interpretability, yet whether it reliably tracks causal importance at the component level remains largely untested. We causally evaluate this assumption across two algorithmic tasks...

Read Original Article on Arxiv CS.AI

arxivpapers