VASAE: Naming SAE Dictionary Directions with Vocabulary-Aligned Anchoring
arXiv:2606.27941v1 Announce Type: cross Abstract: Sparse autoencoders (SAEs) provide useful decompositions of Transformer residual streams, but their learned features are usually named post hoc rather than directly connected to the Transformer's token vocabulary. We introduce Vocabulary-Aligned...
The latest research from arXiv, titled "VASAE: Naming SAE Dictionary Directions with Vocabulary-Aligned Anchoring," tackles a persistent blind spot in mechanistic interpretability: the disconnect between sparse autoencoder (SAE) features and the actual token vocabulary used by language models.
What Happened
Current SAEs decompose transformer residual streams into interpretable features, but these features are typically labeled post hoc by humans or automated systems—a process that is slow, subjective, and prone to misalignment. The VASAE method proposes a fundamental shift: instead of naming features after the fact, it aligns SAE dictionary directions directly with the model's token vocabulary during training. By anchoring features to specific vocabulary directions, the resulting SAE produces features that have an intrinsic, verifiable connection to the tokens the model actually processes. This is not merely a naming convention change; it restructures how the SAE learns its decomposition.
Why It Matters
This work addresses a critical bottleneck for AI safety and interpretability. Current SAE interpretability pipelines often produce "dead" or "polysemantic" features that resist clean labeling. VASAE’s vocabulary-aligned anchoring offers two concrete advantages:
- Verifiability: A feature named "the concept of 'dog'" can be directly traced back to the token embeddings for "dog," "puppy," "canine," etc., providing a ground truth for interpretation.
- Scalability: Automated, vocabulary-grounded naming removes the need for labor-intensive human annotation, enabling large-scale analysis of model internals.
Implications for AI Practitioners
For researchers and engineers working with open-source models, VASAE could streamline two workflows:
- Model editing: When modifying a model’s behavior (e.g., removing harmful outputs), vocabulary-aligned features provide a clearer target for intervention. You know exactly which token-level representations are being altered.
- Safety monitoring: Features that activate on toxic or biased tokens can be identified and tracked more precisely, enabling real-time oversight of model outputs.
Key Takeaways
- VASAE anchors SAE dictionary directions to the model’s token vocabulary, enabling intrinsic, verifiable feature naming rather than post-hoc labeling.
- This method improves the reliability and scalability of mechanistic interpretability, particularly for safety auditing and model editing.
- The trade-off is a potential bias toward token-level concepts, which may limit the capture of abstract or compositional features.
- Practitioners should test VASAE against their specific interpretability needs, especially when working with tasks requiring high-level reasoning beyond token semantics.