Research2026-07-01

Cross-Domain Feature Expansion for Tabular Medical Data via Knowledge Graphs Injection

Originally published byArxiv CS.AI

arXiv:2606.31171v1 Announce Type: new Abstract: Acquiring comprehensive cross-domain biomedical profiles is often costly and time-consuming, resulting in severe data scarcity in medical research. To address this challenge, we propose MedKGTab, a knowledge-injected framework specifically engineered...

What Happened

Researchers have introduced MedKGTab, a novel framework that injects structured knowledge graphs into tabular medical data to overcome the chronic problem of data scarcity in biomedical AI. The core innovation lies in "cross-domain feature expansion"—using external knowledge graphs (such as those linking diseases, drugs, genes, and symptoms) to enrich sparse, high-dimensional patient records. By mapping limited tabular features to broader biomedical ontologies, MedKGTab generates synthetic but biologically plausible feature expansions without requiring additional costly data collection.

The approach is particularly notable because it operates on tabular data, which remains the dominant format in clinical settings—electronic health records, lab results, and registry data are all tabular. Most prior work on knowledge injection has focused on text or graph data, leaving tabular medical data underserved. MedKGTab bridges this gap by treating each patient record as a node that can be connected to an external knowledge graph, then using graph neural networks to propagate information and generate enriched representations.

Why It Matters

Data scarcity in medical AI is not merely a technical inconvenience—it is a structural barrier. Acquiring comprehensive biomedical profiles requires expensive assays, longitudinal studies, and patient consent, all of which limit dataset sizes to hundreds or low thousands of samples. Traditional approaches like transfer learning or data augmentation have limited efficacy on tabular data, which lacks the spatial or temporal structure of images or text.

MedKGTab addresses this bottleneck by leveraging the vast, curated knowledge already encoded in biomedical ontologies (e.g., UMLS, DrugBank, or Gene Ontology). This is fundamentally different from synthetic data generation, which can introduce artifacts. By grounding expansions in established biological relationships, the framework maintains clinical plausibility. Early results suggest significant improvements in downstream tasks such as disease prediction and drug response classification, particularly when training data is extremely limited (e.g., fewer than 500 samples).

For the broader AI field, this work signals that knowledge injection—a technique often associated with large language models—can be effectively adapted to structured, high-stakes domains where data is scarce and errors are costly.

Implications for AI Practitioners

First, practitioners working on clinical decision support systems should evaluate MedKGTab as a preprocessing step before applying standard classifiers. The framework is designed to be model-agnostic, meaning enriched features can feed into XGBoost, random forests, or neural networks without architectural changes.

Second, the approach highlights the value of investing in domain-specific knowledge graph curation. The quality of feature expansion directly depends on the completeness and accuracy of the underlying knowledge graph. Teams should prioritize aligning their tabular features with standardized ontologies (e.g., SNOMED CT, ICD-10) to maximize compatibility.

Third, computational cost remains a consideration. Injecting knowledge graphs adds a graph neural network inference step, which may be prohibitive for real-time applications or resource-constrained environments. Practitioners should benchmark latency against their deployment requirements.

Finally, this work reinforces a broader lesson: in data-scarce medical domains, the most effective path forward may not be collecting more data, but better leveraging the knowledge we already have.

Key Takeaways

MedKGTab injects external biomedical knowledge graphs into tabular medical data to generate biologically plausible feature expansions, addressing data scarcity without costly new data collection.
The framework is model-agnostic and designed for clinical tabular data, a format often neglected by knowledge injection techniques.
Performance gains are most pronounced in extreme data-scarce scenarios (e.g., <500 samples), making it relevant for rare disease research and small-scale clinical studies.
Practitioners must invest in ontology alignment and consider computational overhead before deployment in production environments.

Read Original Article on Arxiv CS.AI

arxivpapers