Research2026-06-18

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

arXiv:2606.18613v1 Announce Type: cross Abstract: The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication. Physician assistance...

The Missing Benchmark: Why PhysAssistBench Matters

A new preprint from arXiv introduces PhysAssistBench, a benchmark designed to evaluate large language models (LLMs) on the integrated task of assisting physicians in clinical workflows. Unlike prior evaluations that test isolated capabilities—such as answering medical exam questions, parsing electronic health records (EHRs), or simulating patient conversations—PhysAssistBench simulates the full, messy reality of a doctor’s day: interacting with a patient, extracting relevant data from an EHR, and producing actionable clinical assistance. The early results are sobering: even top-performing models like GPT-4 and Claude 3.5 Sonnet achieve success rates well below what would be acceptable in a clinical setting, particularly on tasks requiring multi-step reasoning and context switching.

Why This Matters

The significance of PhysAssistBench lies in its design philosophy. Most current medical AI evaluations are “siloed”—they test a model’s knowledge base (e.g., USMLE-style questions), its ability to retrieve information from structured records, or its conversational fluency. But a physician’s real work is integrative: they must listen to a patient’s narrative, cross-reference that with lab results and medication lists, and then formulate a differential diagnosis or treatment plan—all while managing time constraints and incomplete information. PhysAssistBench replicates this by presenting a scenario, a patient dialogue, and an EHR interface, then requiring the LLM to produce a coherent clinical note or recommendation.

The results highlight a critical gap. Models that excel at single-domain tasks often fail when forced to integrate multiple sources of information. For example, a model might correctly identify a drug interaction from an EHR but fail to connect it to a patient’s reported symptom. This is not a trivial failure—it is precisely the kind of error that could lead to clinical harm. The benchmark thus serves as a reality check for the AI community: we are not yet ready to deploy LLMs as reliable physician assistants in live clinical environments.

Implications for AI Practitioners

For those building medical AI systems, PhysAssistBench offers several actionable lessons. First, evaluation must be holistic. If your model can pass the USMLE but cannot handle a multi-turn patient interview while referencing a lab report, it is not ready for deployment. Practitioners should adopt benchmarks that mirror the full workflow, not just isolated skills.

Second, context switching is a hard problem. The benchmark reveals that models struggle to maintain coherence when shifting between patient dialogue and structured EHR data. This suggests that architectural improvements—such as better memory mechanisms or explicit context management—are needed, not just larger training datasets.

Third, safety thresholds must be higher. The paper’s findings imply that current LLMs, even with fine-tuning, are not reliable enough for unsupervised clinical assistance. Practitioners should focus on “human-in-the-loop” designs where the LLM provides suggestions that a physician must verify, rather than autonomous decision-making.

Finally, domain-specific benchmarks are essential. Generic benchmarks like MMLU or even medical QA datasets are insufficient. PhysAssistBench demonstrates that task integration is the true bottleneck. Developers should create or adopt similar benchmarks for their target clinical workflows.

Key Takeaways

PhysAssistBench evaluates LLMs on the full physician assistance workflow—patient interaction, EHR navigation, and clinical reasoning—revealing significant performance gaps.
Current top models fail at multi-step integration, even when they excel at isolated tasks like medical QA or data retrieval.
AI practitioners must prioritize holistic, workflow-based evaluations over siloed benchmarks to ensure real-world safety.
**Deployment-ready medical LLMs will require architectural improvements in context management and a strict human-in-the-loop design.

Read Original Article on Arxiv CS.AI

arxivpapers