Research2026-06-30

Primary ICD Category Prediction using LLM-based Probing

Originally published byArxiv CS.AI

arXiv:2606.28798v1 Announce Type: new Abstract: Objective: ICD codes are central to reimbursement, research, and population health surveillance, yet automated coding systems often struggle to integrate diagnostic signals from both clinical narratives and structured electronic health record (EHR)...

What Happened

Researchers have published a new preprint exploring the use of LLM-based probing to predict primary ICD (International Classification of Diseases) categories from clinical data. The study specifically addresses the challenge of integrating diagnostic signals from both unstructured clinical narratives (e.g., physician notes, discharge summaries) and structured electronic health record (EHR) fields. Rather than building a full automated coding system from scratch, the authors employ a "probing" approach—using a pre-trained large language model as a feature extractor and then training lightweight classifiers on top of its representations. This technique allows the model to leverage the rich semantic understanding of clinical language without requiring massive labeled datasets or fine-tuning the entire LLM.

Why It Matters

ICD coding is the backbone of modern healthcare administration. Accurate codes determine hospital reimbursement, fuel epidemiological research, and enable population health surveillance. Yet manual coding is expensive, error-prone, and subject to human fatigue. Automated systems have existed for years, but they typically struggle with two issues: (1) capturing nuanced clinical context from free-text notes, and (2) effectively combining that textual information with structured data like lab results or vital signs. This research matters because it demonstrates a pragmatic middle ground—using LLMs not as end-to-end coders but as probing tools that extract high-quality features. The approach is computationally efficient compared to full fine-tuning, which is critical for real-world deployment where hospitals may lack GPU clusters. Additionally, by focusing on primary ICD category prediction (the principal diagnosis for an encounter), the work targets the most consequential coding decision, where errors have the highest financial and clinical impact.

Implications for AI Practitioners

For AI teams working in healthcare NLP, this study offers several actionable insights. First, the probing paradigm validates that off-the-shelf LLMs (likely models like ClinicalBERT or GPT variants) encode sufficiently rich representations of clinical text to outperform traditional bag-of-words or RNN-based approaches. Practitioners should consider this as a baseline strategy before investing in expensive domain-specific fine-tuning. Second, the integration of structured EHR data with LLM embeddings is a non-trivial engineering challenge—the paper’s methodology for fusing these modalities (e.g., concatenation, attention mechanisms) provides a template for similar multi-modal clinical tasks. Third, the focus on primary ICD prediction highlights the importance of hierarchical classification: ICD codes exist in a taxonomy, and predicting the top-level category first can reduce error propagation in downstream subcode assignment. Finally, the computational efficiency of probing means smaller hospitals or research groups with limited resources can still leverage state-of-the-art language models, democratizing access to advanced clinical NLP.

Key Takeaways

LLM-based probing offers a computationally efficient alternative to full fine-tuning for clinical coding tasks, making advanced NLP accessible to resource-constrained healthcare settings.
The study tackles the real-world challenge of fusing unstructured clinical narratives with structured EHR data, providing a practical blueprint for multi-modal medical AI.
Focusing on primary ICD category prediction addresses the highest-stakes coding decision, with direct implications for reimbursement accuracy and clinical research validity.
Practitioners should evaluate probing strategies as a strong baseline before committing to more expensive model training pipelines.

Read Original Article on Arxiv CS.AI

arxivpapers