Research2026-05-08
Attributions All the Way Down? The Metagame of Interpretability
Source: Arxiv CS.AI
arXiv:2605.06295v1 Announce Type: cross Abstract: We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution $\phi(f)$ explaining a model $f$, we measure the directional influence of feature $j$ on the...
arxivpapers