Research2026-06-30

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

Originally published byArxiv CS.AI

arXiv:2606.29876v1 Announce Type: cross Abstract: Modern large language models (LLMs) reach 60-70% diagnostic accuracy on complex clinical case benchmarks, but accuracy alone cannot distinguish stable clinically-grounded reasoning from pattern matching. We introduce clinical reasoning graphs,...

This new research from arXiv introduces a critical stress test for the current generation of large language models (LLMs) in medicine. While LLMs now routinely score 60-70% on diagnostic benchmarks—a figure that sounds promising—the authors of this paper argue that accuracy alone is a dangerously misleading metric. To probe deeper, they introduce Clinical Reasoning Graphs (CRGs) , a structured framework designed to visualize and evaluate the pathway an LLM takes to reach a diagnosis, not just the final answer.

What Happened: Moving Beyond the Final Answer

The core problem the paper identifies is the distinction between competence (getting the right answer) and consistency (using reliable, clinically-grounded logic to get there). An LLM might correctly diagnose a case of bacterial meningitis, but its reasoning graph could reveal it ignored a key negative lab result or fixated on a statistical red herring. The CRG methodology forces models to externalize their reasoning steps—linking symptoms, test results, differential diagnoses, and final conclusions into a formal graph structure.

By analyzing these graphs, the researchers found a troubling phenomenon: high diagnostic accuracy coexists with low reasoning consistency. A model might get Case A right through sound clinical logic, but get Case B right through a completely spurious correlation. This means that current benchmarks, which only check the final answer, are effectively measuring a mix of genuine medical reasoning and sophisticated pattern matching. The LLM is "competent" in the sense that it produces the correct output, but it is not "consistent" in its clinical reasoning process.

Why It Matters: The Illusion of Reliability

For the healthcare industry, this is a significant red flag. The difference between competence and consistency is the difference between a brilliant but erratic junior resident and a reliable attending physician. In a high-stakes environment like diagnosis, we cannot trust a system that is right 65% of the time if we don't know which 65% of its answers are based on sound logic.

If an LLM misdiagnoses a patient, we need to be able to audit why. A CRG provides that audit trail. Without it, a model that "passes" a benchmark may still be dangerously brittle when faced with an atypical presentation or a subtle clinical nuance. This research suggests that the current wave of medical LLMs may be overfit to the statistical patterns of test questions rather than learning the causal relationships of pathophysiology.

Implications for AI Practitioners

For developers and deployers of LLMs in regulated domains, this paper offers a clear directive: stop optimizing for accuracy alone. The CRG framework provides a template for a new kind of evaluation pipeline.

New Evaluation Metrics: Teams should adopt process-based metrics (e.g., graph edit distance from a gold-standard reasoning path, or consistency scores across multiple runs of the same case) alongside outcome-based accuracy.
Training Data Curation: The findings imply that training data should include not just correct diagnoses, but explicit, step-by-step clinical reasoning chains. Datasets like the ones used to build CRGs could become essential for fine-tuning.
Explainability as a Requirement: For any AI product targeting clinical decision support, a CRG-like output should be a mandatory feature. It turns the LLM from a "black box" into a system whose logic can be reviewed by a human expert.
Regulatory Strategy: This research provides a technical foundation for regulators (like the FDA) to demand more than just benchmark scores. A submission for a diagnostic AI may soon need to include evidence of reasoning consistency, not just final-answer accuracy.

Key Takeaways

Accuracy is not enough: LLMs can achieve high diagnostic scores through pattern matching, not genuine clinical reasoning, creating a false sense of reliability.
Clinical Reasoning Graphs offer a new audit tool: This framework allows evaluators to visualize and score the logical pathway an LLM uses, distinguishing competence from consistency.
Practitioners must adopt process-based metrics: Teams building medical AI should evaluate reasoning graphs, not just final answers, to ensure robust and trustworthy performance.
The research sets a precedent for regulation: Expect future compliance requirements for medical AI to include evidence of consistent, auditable reasoning paths, not just benchmark scores.

Read Original Article on Arxiv CS.AI

arxivpapersreasoning