Research2026-05-12
Hidden Heroes and Gradient Bloats: Layer-Wise Redundancy Inverts Attribution in Transformers
Source: Arxiv CS.AI
arXiv:2602.01442v3 Announce Type: replace-cross Abstract: Gradient-based attribution is the workhorse of mechanistic interpretability, yet whether it reliably tracks causal importance at the component level remains largely untested. We causally evaluate this assumption across two algorithmic tasks...
arxivpapers