Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision
arXiv:2606.32038v1 Announce Type: cross Abstract: When does training language models (LMs) to generate explanations of their predictions yield faithful introspection, rather than superficial imitation? We study LMs trained to explain which features of their inputs influenced their behavior, using...
The Self-Explanation Paradox: When LMs Learn Introspection, Not Imitation
The new arXiv paper "Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision" tackles a foundational question in interpretability: can language models genuinely introspect, or do they merely learn to mimic plausible-sounding explanations? The researchers trained LMs to generate explanations linking their predictions to specific input features, then tracked whether those explanations faithfully reflected actual behavioral changes when inputs were perturbed.
The core finding is that models trained on self-explanation tasks develop a form of "introspective coupling"—their explanations correlate with genuine behavioral shifts, even when the supervision signal (the training data for explanations) remains fixed. This suggests the models are not simply memorizing explanation templates, but are learning to attend to causally relevant features in a way that aligns with their own decision-making processes.
Why This Matters
This research addresses a critical blind spot in current AI safety and interpretability work. Most explanation methods—from attention visualization to feature attribution—assume a static relationship between model internals and outputs. But LMs are dynamic systems; their behavior changes with training, and explanations must track those changes to be trustworthy. The paper demonstrates that self-explanation training can produce this tracking property, which is a necessary (though not sufficient) condition for faithful introspection.
The distinction between "faithful introspection" and "superficial imitation" is not academic. If models merely parrot human-like explanations without causal grounding, they become dangerous tools for rationalizing errors. A model that says "I predicted 'positive sentiment' because of the word 'great'" but would actually flip its prediction if "great" were removed is not explaining—it is confabulating. This work provides a methodology for detecting such confabulation by measuring whether explanation-behavior coupling persists through training.
Implications for AI Practitioners
For those building or deploying LLM-based systems, this research offers both a warning and a practical tool. The warning: don't assume that a model's explanations are faithful just because they sound coherent. The tool: you can test explanation faithfulness by perturbing inputs and checking whether the model's stated reasons align with actual behavioral changes. This is analogous to how you would validate a scientific instrument—by introducing known stimuli and measuring response.
Practitioners should also note the training methodology. The paper's approach of coupling explanation generation with behavioral tracking during training suggests that explanation faithfulness is not just a post-hoc evaluation metric but can be actively optimized. This opens the door to training pipelines that explicitly reward models for explanations that remain consistent under input perturbations—a form of self-consistency regularization.
However, the paper does not claim that self-explanation training solves the interpretability problem. Introspective coupling is a necessary condition for faithful introspection, but not sufficient. Models could still learn to track their own behavior without understanding why—a kind of behavioral mimicry of introspection. The key takeaway is that this tracking property is learnable and measurable, which is more than many skeptics assumed possible.
Key Takeaways
- Self-explanation training can produce models whose explanations track actual behavioral changes, not just superficial imitation of human-like reasoning
- The "introspective coupling" property can be measured by perturbing inputs and checking if explanation-behavior alignment persists through training
- Practitioners should treat explanation faithfulness as a testable property, not an assumption—use input perturbation as a validation technique
- While promising, this approach is a necessary condition for faithful introspection, not a complete solution to the interpretability challenge