Research2026-06-26

AI Healthcare Chatbots as Information Infrastructure: A Large-Scale Study of User-Reported Breakdowns

arXiv:2606.27302v1 Announce Type: cross Abstract: AI healthcare chatbots are increasingly used to support health information seeking and self-management, yet their performance and impact on users remains to be studied. This study examines over 15,000 user reviews from 59 AI healthcare chatbot apps...

The Infrastructure Blind Spot: What 15,000 User Reviews Reveal About Healthcare Chatbots

A new large-scale study on arXiv has analyzed over 15,000 user reviews from 59 AI healthcare chatbot applications, cataloging patterns of user-reported breakdowns. Rather than focusing on clinical accuracy alone, the research frames these chatbots as "information infrastructure"—systems that users depend on not for occasional queries, but as ongoing, embedded tools for health self-management. The findings are sobering: breakdowns are frequent, often subtle, and rarely about simple factual errors.

The study categorizes failures into recurring types: misinterpretation of symptoms, inappropriate escalation or triage, inability to handle context or history, and responses that users perceived as dismissive or alarmist. Notably, many breakdowns occurred not when the chatbot was wrong, but when it failed to align with user expectations about tone, urgency, or the kind of information being sought. A user asking "Should I go to the ER?" expects a different interaction than one asking "What are the side effects of this medication?"—and the chatbots studied often blurred these lines.

Why This Matters

This research shifts the conversation from "does the chatbot give correct answers?" to "does the chatbot function reliably as part of a user's health management routine?" That distinction is critical. A chatbot that provides accurate medical facts but fails to recognize when a user is describing a genuine emergency is not just unhelpful—it is dangerous. Conversely, a chatbot that over-escalates every minor symptom erodes trust and burdens healthcare systems.

The scale of the dataset is significant. Fifteen thousand reviews across 59 apps provides a cross-sectional view that goes beyond controlled lab studies. These are real users in real distress, documenting their frustrations in app stores. The patterns are consistent: users treat these chatbots as infrastructure, not novelties. They expect them to remember past conversations, adapt to changing symptoms, and know when to defer to human professionals.

Implications for AI Practitioners

For developers and product teams, the study offers a clear mandate. First, evaluation metrics for healthcare chatbots must extend beyond accuracy. Practitioners should measure "interactional fit"—how well the chatbot matches user intent, urgency, and emotional state. Second, the concept of "breakdown" needs operationalizing. Not all errors are equal; a chatbot that misidentifies a rash is different from one that fails to recognize suicidal ideation. Teams should build taxonomies of failure modes specific to health contexts.

Third, transparency is not optional. Users in the study reported frustration when chatbots could not explain their reasoning or limitations. Practitioners should implement clear disclaimers, confidence indicators, and explicit handoff protocols to human clinicians. Finally, the infrastructure framing suggests that chatbots should be designed with statefulness—remembering context across sessions—and with guardrails that prevent cascading errors when users rely on them repeatedly.

Key Takeaways

Over 15,000 user reviews reveal that healthcare chatbot failures are often about misaligned expectations, not just factual inaccuracy.
Users treat these apps as ongoing health infrastructure, expecting context awareness, appropriate triage, and emotional sensitivity.
Practitioners must evaluate chatbots on interactional fit and build explicit failure taxonomies for health-specific breakdowns.
Statefulness, transparent reasoning, and clear handoff protocols are essential for safe deployment at scale.

Read Original Article on Arxiv CS.AI

arxivpapers