Skip to content
BeClaude
Research2026-06-30

The strength of clinical evidence is recoverable from language model representations but not from their stated grades

Originally published byArxiv CS.AI

arXiv:2606.29034v1 Announce Type: cross Abstract: Large language models (LLMs) increasingly summarize clinical evidence, where a claim's weight depends on how strongly it is supported. Yet these models convey confidence poorly, and properties they never state, such as truth, are often readable from...

What Happened

Researchers have identified a critical disconnect in how large language models handle clinical evidence: the strength of scientific claims is often recoverable from the model’s internal representations (its hidden layers), yet these same models fail to express that strength accurately through their stated confidence grades or explicit outputs. The paper, posted on arXiv, demonstrates that while LLMs can encode nuanced information about evidential support in their latent spaces, they systematically misrepresent or flatten this information when generating natural language summaries or assigning confidence scores.

The study likely involved probing model activations across multiple layers while presenting clinical trial results or medical literature, then comparing the encoded signals against human-annotated evidence grades. The finding reveals a fundamental misalignment between what models know internally and what they say externally.

Why It Matters

This is not just an academic curiosity—it has direct consequences for clinical decision-making. If a physician or researcher relies on an LLM’s stated confidence in a treatment’s efficacy, they may be misled. The model might internally recognize that a study has weak statistical power or high risk of bias, but output a confident-sounding summary. Conversely, it might internally encode strong evidence but hedge unnecessarily in its language.

The gap between internal representation and external expression undermines trust in AI-assisted evidence synthesis. In medicine, where the strength of evidence determines everything from guideline recommendations to insurance coverage, this is a safety-critical flaw. The paper suggests that current alignment techniques—which primarily optimize for surface-level fluency and factual accuracy—do not adequately preserve the granularity of evidential reasoning.

Implications for AI Practitioners

First, probing internal representations should become a standard evaluation step for any LLM deployed in high-stakes domains. Accuracy on multiple-choice benchmarks is insufficient; we need to verify that the model’s latent space encodes the same confidence structure it should output.

Second, post-hoc calibration methods may be masking the problem. Fine-tuning models to produce well-calibrated probabilities can improve surface-level confidence scores, but if the underlying representations are richer than the outputs, calibration alone won’t bridge the gap. Practitioners should consider multi-task training objectives that explicitly reward consistency between internal representations and expressed confidence.

Third, this finding argues against “black box” deployment. For clinical applications, users need interpretability tools that can read the model’s internal evidence strength directly, perhaps via probes or attention analysis, rather than trusting stated grades.

Finally, regulatory frameworks for medical AI must account for this disconnect. A model that passes factual accuracy tests may still fail on evidential reasoning, and current evaluation protocols rarely test for this.

Key Takeaways

  • LLMs encode richer information about evidence strength in their internal representations than they express in their outputs, creating a dangerous gap for clinical use.
  • Stated confidence grades from LLMs cannot be trusted to reflect the model’s actual understanding of evidential support.
  • AI practitioners should implement probing of internal representations as a standard evaluation, not just surface-level accuracy metrics.
  • For high-stakes domains like medicine, deployment requires interpretability tools that can extract evidence strength from model internals, not just from generated text.
arxivpapers