Skip to content
BeClaude
Research2026-07-02

Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations

Originally published byArxiv CS.AI

arXiv:2503.13445v3 Announce Type: replace-cross Abstract: When asked to explain their decisions, LLMs can often give explanations which sound plausible to humans. But are these explanations faithful, i.e. do they convey the factors actually responsible for the decision? In this work, we analyse...

Large language models are increasingly relied upon to not only generate answers but also to explain why they arrived at those answers. This new research from arXiv directly tackles a critical and often overlooked problem: the faithfulness of those self-explanations. The core finding is that as models scale up in size and capability, a troubling trade-off emerges between how verbose (detailed) an explanation is and how accurately it reflects the model’s actual decision-making process.

The study systematically analyzes LLMs of varying scales, asking them to produce explanations for their outputs. The researchers then probe whether those explanations are faithful—meaning they correctly identify the input features or reasoning steps that truly influenced the model’s prediction, rather than post-hoc rationalizations that sound plausible but are factually disconnected from the model’s internal computations. The results suggest that larger, more powerful models tend to generate more fluent and elaborate explanations, but these explanations are often less faithful. In contrast, smaller models may produce simpler, less impressive explanations that are, paradoxically, more truthful to their actual reasoning.

Why this matters: The implications are profound for the deployment of LLMs in high-stakes domains. If a model can convincingly explain a medical diagnosis, a legal analysis, or a financial recommendation using a narrative that sounds correct but is actually a fabrication, it creates a dangerous illusion of transparency. This is the "explanation trap"—users trust the model more because it sounds like it knows what it’s doing, when in reality the explanation is a sophisticated guess. The research underscores that scale alone does not solve the faithfulness problem; in fact, it may exacerbate it by making the model better at generating convincing falsehoods. For AI practitioners, this research offers several actionable insights:

First, never treat an LLM’s self-explanation as ground truth. Implementing separate verification mechanisms—such as attention analysis, input perturbation tests, or counterfactual probing—is essential to validate whether the explanation matches the model’s actual behavior.

Second, consider the trade-off between verbosity and faithfulness. When building systems for regulated or safety-critical applications, a model that produces shorter, less eloquent explanations may actually be more trustworthy than a larger model that generates polished but unreliable narratives.

Third, invest in evaluation frameworks that measure faithfulness, not just fluency. Many current benchmarks reward models for generating coherent explanations, but this research highlights the need for metrics that directly test whether the explanation aligns with the model’s internal decision process.

Key Takeaways

  • Larger LLMs produce more fluent but often less faithful self-explanations, creating a dangerous trust illusion.
  • Verbose explanations are not inherently better—scale can amplify the gap between what the model says and what it actually computes.
  • Practitioners must implement external validation methods (e.g., input perturbation, attention analysis) to verify explanation faithfulness.
  • For high-stakes applications, smaller models with simpler, more accurate explanations may be preferable to larger models with polished but unreliable narratives.
arxivpapers