Research2026-06-24

RASC+: Retrieval-Constrained LLM Adjudication for Clinical Value Set Authoring

arXiv:2606.23992v1 Announce Type: cross Abstract: Clinical value sets define the standardized terminology codes used in quality measurement, phenotyping, cohort construction, and clinical decision support. The recently introduced Retrieval-Augmented Set Completion (RASC) benchmark showed that...

The Adjudication Problem in Clinical AI

A new preprint on arXiv introduces RASC+ (Retrieval-Constrained LLM Adjudication), a framework designed to improve how large language models handle the meticulous task of clinical value set authoring. This work builds directly on the earlier RASC (Retrieval-Augmented Set Completion) benchmark, which exposed fundamental limitations in LLMs’ ability to accurately select standardized medical terminology codes—such as SNOMED CT, ICD-10, or LOINC—for specific clinical concepts.

The core innovation is straightforward but significant: instead of relying on a single LLM pass or simple retrieval augmentation, RASC+ introduces an adjudication layer. When the model generates a candidate set of codes, a second LLM—or the same model with a constrained prompt—reviews the output against retrieved clinical evidence, flagging inconsistencies, missing codes, or spurious inclusions. This creates a structured verification loop that mimics the peer-review process used in clinical informatics.

Why This Matters for Healthcare AI

Clinical value sets are the unsung infrastructure of modern digital health. They define which codes count as “diabetes,” “heart failure,” or “post-operative complication” in electronic health records. Errors in these sets propagate into quality metrics, population health analytics, and clinical decision support alerts. A patient with gestational diabetes might be incorrectly counted in a diabetes registry; a post-surgical infection could be missed by surveillance algorithms.

The RASC+ approach addresses a persistent pain point: LLMs, despite their broad medical knowledge, struggle with the granular, context-dependent nature of code selection. A model might know that “type 2 diabetes” maps to E11.9 in ICD-10, but fail to recognize that the same concept requires multiple SNOMED CT codes when used in a pediatric quality measure. By constraining the LLM with retrieved authoritative sources and then adjudicating its output, RASC+ reduces both false positives (overly broad code sets) and false negatives (missing codes).

Implications for AI Practitioners

For those building clinical AI systems, this work highlights three practical lessons:

First, retrieval augmentation alone is insufficient. Simply feeding an LLM relevant documents does not guarantee accurate code selection. The adjudication step—explicitly checking outputs against constraints—adds a quality gate that mirrors how human experts work.

Second, the benchmark matters. The RASC benchmark provides a standardized evaluation framework, which is rare in clinical NLP. Practitioners should watch for adoption of this benchmark when evaluating medical coding models.

Third, domain-specific constraints are tractable. The paper demonstrates that clinical coding rules (e.g., “a value set must include all descendants of a parent concept”) can be formalized as constraints that LLMs can check, even if they struggle to generate compliant outputs from scratch.

The approach is not without limitations. The adjudication step doubles computational cost and introduces latency—a concern for real-time clinical decision support. Additionally, the framework’s performance depends heavily on the quality of the retrieval corpus; outdated or incomplete terminology sources will undermine the adjudication.

Key Takeaways

RASC+ introduces a structured adjudication layer that verifies LLM-generated clinical code sets against retrieved evidence, reducing errors in value set authoring.
Accurate clinical value sets are critical for quality measurement, phenotyping, and decision support; even small errors propagate into downstream clinical and operational decisions.
For AI practitioners, the work underscores that retrieval augmentation alone is insufficient—explicit verification constraints are needed for high-stakes medical tasks.
The RASC benchmark provides a much-needed standardized evaluation for clinical code generation, and its adoption could accelerate progress in this specialized domain.

Read Original Article on Arxiv CS.AI

arxivpapers