Reported Confidence in LLMs Tracks Commitment More Than Correctness
arXiv:2606.29490v1 Announce Type: cross Abstract: Confidence is an estimate of the probability that a chosen answer is correct. Verbal confidence reports are widely used as uncertainty measures in large language models, but whether they are best understood as estimates of correctness is unclear. We...
The Confidence Paradox: Why LLMs Sound Certain Even When Wrong
A new preprint from arXiv (2606.29490v1) tackles a fundamental question in large language model behavior: what do verbal confidence reports actually measure? The researchers argue that when an LLM says it is “90% confident” in an answer, that number correlates more strongly with how committed the model is to its response trajectory than with actual correctness. In other words, confidence reflects the model’s internal consistency—how strongly its own generation process settled on that output—rather than a calibrated probability of being right.
This distinction matters because the AI industry increasingly relies on verbalized confidence as a safety mechanism. Developers use it to decide when to defer to human judgment, when to flag uncertain outputs, and when to trust automated decisions. If confidence is primarily a measure of commitment, then a model that confidently produces a wrong answer is not merely failing calibration—it is actively misleading the user about its reliability.
The research suggests that LLMs develop a kind of “conviction” during generation. Once the model has committed to a reasoning path or a token sequence, its confidence score rises regardless of whether that path leads to truth. This is reminiscent of human cognitive biases like the anchoring effect, where initial commitments skew subsequent judgment. For AI systems, this means that early token choices can lock the model into a confident error trajectory, and no amount of later self-assessment can correct it.
Why This Matters for AI Practitioners
For anyone deploying LLMs in production, this finding has immediate practical implications. First, confidence thresholds are not reliable safety filters. Setting a threshold of 0.9 and assuming outputs above that are trustworthy could be dangerous if the model is simply confident in its own mistakes. Second, it challenges the current practice of using verbalized confidence as a proxy for uncertainty in retrieval-augmented generation (RAG) systems, where the model might confidently hallucinate a citation.
The research also raises questions about alignment techniques. If confidence is tied to commitment, then methods like RLHF that reward confident-sounding outputs may inadvertently reinforce this behavior, making models sound more certain while not improving actual accuracy.
Key Takeaways
- Confidence ≠ correctness: Verbal confidence reports in LLMs primarily track the model’s internal commitment to its chosen answer, not the probability of that answer being factually right.
- Current safety practices need revision: Relying on confidence thresholds for deferral or flagging may create a false sense of security, especially when models are confidently wrong.
- Early token decisions matter: The model’s commitment to a reasoning path early in generation can lock it into confident errors that later self-assessment cannot correct.
- Practitioners should seek alternative uncertainty metrics: Instead of verbalized confidence, consider using log-probability distributions, ensemble disagreement, or semantic entropy to gauge actual uncertainty.