Research2026-06-18

Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis

arXiv:2606.19183v1 Announce Type: cross Abstract: Large language models (LLMs) can make clinical decision support more accessible by interpreting free-text documentation, but their direct use as diagnostic engines is limited by sensitivity to prompts, information order, and plausible but incorrect...

What Happened

A new preprint from arXiv (2606.19183v1) proposes a hybrid architecture that reframes the role of large language models in clinical decision support. Rather than using LLMs as standalone diagnostic oracles—asking them to directly output a diagnosis from patient notes—the system delegates the final classification to a traditional machine learning model. The LLM acts as an interface layer, extracting structured features from unstructured free-text clinical documentation, which are then fed into a dedicated ML classifier trained specifically for pediatric appendicitis detection.

The researchers explicitly identify the core failure modes of LLM-only approaches: sensitivity to prompt phrasing, dependence on the order in which information is presented, and a tendency to generate plausible but incorrect reasoning. By constraining the LLM to a feature extraction role, the system retains the flexibility of natural language input while offloading the high-stakes binary classification to a more robust, deterministic model.

Why It Matters

This work addresses a fundamental tension in applied AI: LLMs excel at understanding and transforming unstructured human language, but they are notoriously unreliable for tasks requiring consistent, factual output—especially in medicine. The hybrid approach acknowledges that "intelligence" in clinical AI is not a single capability but a pipeline of distinct competencies.

For healthcare AI, this is a pragmatic correction to the hype cycle. Many recent efforts have attempted to use LLMs as end-to-end diagnostic tools, only to find them brittle in real-world settings where a single misdiagnosis carries serious consequences. The hybrid design mitigates this by separating the probabilistic, creative task (language understanding) from the deterministic, evidence-based task (classification). The ML model can be rigorously validated, calibrated, and audited—something that remains difficult for LLM reasoning chains.

More broadly, this pattern—LLM as interface, not oracle—represents a reusable architectural insight. It suggests that the most reliable AI systems will not be monolithic models, but rather orchestrated pipelines where each component does what it does best.

Implications for AI Practitioners

First, practitioners should reconsider the default assumption that larger, more capable LLMs should be used for end-to-end tasks. The paper demonstrates that a smaller, cheaper LLM used for feature extraction, paired with a simple classifier, can outperform a more powerful model used directly.

Second, this architecture offers a clear path to regulatory compliance. In regulated domains like healthcare, the ability to audit the ML classifier independently of the LLM is a significant advantage. The LLM becomes a preprocessing step, not a decision-maker, which simplifies liability and validation.

Third, the approach highlights the importance of prompt engineering and output parsing. The reliability of the entire pipeline depends on the LLM consistently extracting the correct features. Practitioners will need to invest in structured output formats (e.g., JSON schemas) and fallback mechanisms for when the LLM produces malformed or missing features.

Finally, this hybrid pattern is domain-agnostic. Any field where unstructured text must be converted into structured inputs for a downstream model—legal document analysis, financial risk assessment, customer support triage—can benefit from this separation of concerns.

Key Takeaways

A hybrid LLM-ML system for pediatric appendicitis uses the LLM only for feature extraction, not diagnosis, avoiding the brittleness of LLM-only clinical decision support.
The architecture separates language understanding from classification, allowing each component to be optimized and validated independently.
This pattern offers a practical path for deploying LLMs in high-stakes, regulated environments by making the decision pipeline auditable and controllable.
AI practitioners should consider this "LLM as interface, not oracle" design for any application requiring both natural language input and reliable, deterministic output.

Read Original Article on Arxiv CS.AI

arxivpapers