Research2026-07-02

Text Over Image: Auditing Multimodal Robustness in Synthetic Medical Image Detection

Originally published byArxiv CS.AI

arXiv:2606.25375v2 Announce Type: replace-cross Abstract: With the rapid adoption of generative AI, synthetic medical images pose growing risks, including diagnostic deception and insurance fraud. Although prior work has explored vision-language model (VLM)-based synthetic image detection, these...

The Blind Spot in Medical Image Forensics

The latest preprint from arXiv (2606.25375) tackles a critical vulnerability in synthetic medical image detection: the gap between what vision-language models (VLMs) see and what they read. While prior work has focused on detecting AI-generated medical images through visual artifacts alone, this research introduces a more realistic threat model—one where text overlays on images can fool even sophisticated multimodal detectors.

The core finding is straightforward yet alarming: synthetic medical images with superimposed text (e.g., patient IDs, scan parameters, or diagnostic labels) can bypass current detection systems that rely on unimodal visual analysis. The authors demonstrate that VLMs, which process both image pixels and text tokens, are susceptible to adversarial text overlays that don't degrade the image's visual plausibility but confuse the detection pipeline.

Why This Matters

This isn't an academic curiosity. Medical imaging fraud is a multi-billion dollar problem, and synthetic images are increasingly weaponized for:

Insurance fraud: Fabricated X-rays or MRIs to support false claims
Diagnostic deception: Inserting fake findings into legitimate imaging workflows
Clinical trial manipulation: Creating phantom patient data for drug approvals

The text-over-image vulnerability is particularly dangerous because it exploits a fundamental asymmetry: detection models are trained on clean, text-free images, but real-world medical images almost always contain embedded text. A fraudster could generate a synthetic CT scan, overlay realistic-looking patient metadata, and have it pass detection—not because the image is convincing, but because the detector's visual-only analysis misses the textual context.

Implications for AI Practitioners

1. Multimodal detection is necessary but not sufficient. Simply adding a text encoder to a vision model isn't enough. Practitioners must train on paired adversarial examples where text overlays are deliberately injected into both real and synthetic images. The research suggests that text-aware detection requires joint embedding spaces that explicitly model text-image consistency. 2. Domain-specific robustness testing is critical. Medical imaging pipelines should include stress tests with realistic text overlays (varying fonts, positions, and medical terminology). A model that achieves 99% accuracy on clean images may drop to 70% when text is present. 3. Human-in-the-loop verification remains essential. For high-stakes medical decisions, automated detection should flag suspicious images for human review rather than making binary decisions. The text-overlay attack exploits statistical patterns that VLMs learn—humans are less susceptible to this specific failure mode. 4. Synthetic image detection is an arms race. As detection methods improve, so will adversarial attacks. This research underscores the need for continuous adversarial training and the development of explainable detection systems that can articulate why an image is flagged, not just that it is.

Key Takeaways

New attack vector: Text overlays on synthetic medical images can bypass current VLM-based detection systems that focus solely on visual artifacts
Real-world relevance: Medical images almost always contain embedded text, making this vulnerability exploitable for fraud and deception
Detection requires joint modeling: Effective defense demands training on text-image paired adversarial examples, not just unimodal visual analysis
Operational caution: Automated detection should be augmented with human review for high-stakes medical applications, as adversarial robustness remains incomplete

Read Original Article on Arxiv CS.AI

arxivpapersmultimodal