Research2026-05-08

From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features

arXiv:2605.06494v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) have become central to mechanistic interpretability, decomposing transformer activations into monosemantic features. Yet existing analyses characterise features almost exclusively through top-activating token lists or...

Read Original Article on Arxiv CS.AI

arxivpapers