C$^{2}$R: Cross-sample Consistency Regularization Mitigates Feature Splitting and Absorption in Sparse Autoencoders
arXiv:2606.30609v1 Announce Type: cross Abstract: Sparse Autoencoders (SAEs) are widely used to interpret large language models by decomposing activations into sparse, human-understandable features, but scaling to large dictionaries exposes fundamental challenges. Systematic studies reveal...
Sparse Autoencoders (SAEs) have become a cornerstone tool for mechanistic interpretability, allowing researchers to isolate individual concepts within a model’s activations. However, a new paper from arXiv (C$^{2}$R) reveals a critical scaling problem: as SAE dictionaries grow larger, they suffer from two pathological behaviors—feature splitting and feature absorption. The authors propose a novel solution, Cross-sample Consistency Regularization, to stabilize these models.
What Happened
The researchers systematically studied SAEs at scale and identified a destructive dynamic. Feature splitting occurs when a single interpretable concept (e.g., “the word ‘cat’”) gets fragmented across multiple, redundant dictionary features. Feature absorption is the opposite: one feature becomes overly dominant, “absorbing” variance from related concepts and suppressing other features. Both issues degrade the sparsity and interpretability of the learned representations, making SAEs unreliable for understanding large models.
To combat this, the team introduces C$^{2}$R—a regularization technique that enforces consistency across different samples. Instead of treating each activation vector independently, C$^{2}$R penalizes the SAE when its feature activations for similar input contexts diverge unpredictably. This cross-sample constraint encourages the encoder to learn stable, non-redundant features that generalize across the dataset, effectively curbing both splitting and absorption.
Why It Matters
This research addresses a fundamental bottleneck in mechanistic interpretability. As language models grow, so must the SAE dictionaries used to probe them. If larger dictionaries naturally collapse into splitting or absorption, the entire enterprise of scaling interpretability tools becomes suspect. The C$^{2}$R method is a practical fix that directly targets the optimization instability of SAEs, not just their architecture.
For the field, this work shifts the conversation from “can we train a large SAE?” to “can we train a good large SAE?”. The paper provides empirical evidence that without such regularization, even carefully tuned SAEs degrade at scale. This is a timely warning for labs investing heavily in SAE-based interpretability pipelines.
Implications for AI Practitioners
- Interpretability researchers should treat C$^{2}$R as a new baseline. If you are training SAEs with dictionaries exceeding 10,000 features, you are likely encountering splitting and absorption. Implementing cross-sample consistency can dramatically improve feature quality without changing the core SAE architecture.
- Safety and alignment teams relying on SAE-derived features for monitoring or steering models must verify that their features are not artifacts of splitting or absorption. A feature that appears to represent “honesty” might actually be a fragmented part of a larger concept, leading to unreliable interventions.
- Engineering workflows will need to adjust training loops. C$^{2}$R introduces a batch-level regularization term, which adds computational overhead. However, the paper suggests this cost is justified by the gains in feature stability and dictionary efficiency.
Key Takeaways
- Large SAE dictionaries suffer from feature splitting (redundancy) and feature absorption (dominance), which degrade interpretability.
- Cross-sample Consistency Regularization (C$^{2}$R) mitigates these issues by enforcing stable feature activations across similar inputs.
- This method is a practical, architecture-agnostic fix that scales with dictionary size, making it critical for future interpretability research.
- Practitioners should audit existing SAEs for these pathologies and consider adopting C$^{2}$R as a standard training component.