Research2026-06-19

Ensembles of Large Language Models for Identifying EQ-5D Studies in PubMed Based on Their Abstracts

arXiv:2606.19345v1 Announce Type: cross Abstract: The rapid increase in scientific publications leads to the fact that manual study screening in systematic literature reviews (SLRs) is increasingly resource consuming, inefficient, and inconsistent. Classifying studies that clearly report...

Automating Systematic Reviews: How LLM Ensembles Tackle Medical Literature Overload

The arXiv preprint (2606.19345v1) presents a practical application of large language model ensembles for identifying EQ-5D health-related quality-of-life studies in PubMed based solely on abstracts. The core innovation lies not in a single model, but in combining multiple LLMs to classify research abstracts with higher accuracy than any individual model could achieve. This addresses a genuine bottleneck: systematic literature reviews (SLRs) in medicine are becoming prohibitively expensive and time-consuming as publication volumes explode.

Why This Matters Beyond Medical Research

Systematic reviews are the backbone of evidence-based medicine, but their manual screening process is notoriously inefficient. A typical SLR may require reviewers to manually assess thousands of abstracts, with inter-rater reliability often being suboptimal. The application of LLM ensembles here offers three distinct advantages:

First, reliability through redundancy. Ensembles reduce the risk of any single model's biases or blind spots dominating classification decisions. For medical applications where false negatives (missing relevant studies) can have serious consequences, this redundancy is critical.

Second, cost efficiency at scale. While running multiple LLMs increases computational cost per query, it eliminates the far greater human cost of manual screening. For large-scale reviews covering tens of thousands of abstracts, the economics favor automation.

Third, reproducibility. Human reviewers fatigue and vary in judgment; LLM ensembles, once validated, produce consistent classifications across identical inputs.

Implications for AI Practitioners

This work signals several practical lessons for deploying LLMs in specialized domains:

Domain-specific fine-tuning remains valuable. Generic LLMs struggle with medical terminology and study design nuances. The ensemble approach likely benefits from models fine-tuned on biomedical literature (e.g., PubMedBERT variants) combined with general-purpose models. Abstract-only classification has limitations. The paper's focus on abstracts is pragmatic—full-text access is often restricted—but it also means the system cannot capture details buried in methodology sections. Practitioners should set realistic expectations about recall rates. Ensemble design matters more than model size. The choice of which models to combine, how to weight their outputs, and how to handle disagreements (e.g., requiring consensus vs. majority vote) significantly impacts performance. This is where domain expertise becomes essential. Validation against human judgment is non-negotiable. Any automated screening system must be benchmarked against gold-standard human reviews before deployment, particularly in clinical contexts where missed studies could affect patient care recommendations.

Key Takeaways

LLM ensembles offer a practical path to automating the most labor-intensive phase of systematic literature reviews, reducing both time and human error
The approach works best when combining domain-specific and general-purpose models, rather than relying on any single architecture
Abstract-only classification is a pragmatic trade-off that sacrifices some accuracy for broad applicability across paywalled databases
Practitioners must validate ensemble outputs against human reviewers and establish clear protocols for handling borderline or ambiguous classifications

Read Original Article on Arxiv CS.AI

arxivpapers