Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue
arXiv:2606.31719v1 Announce Type: cross Abstract: In collaborative dialogue, shared perception does not guarantee shared interpretation. Mutual understanding must be established through interaction. We investigate whether vision-language models (VLMs) can distinguish what could be shared from what...
What Happened
A new preprint from arXiv (2606.31719v1) investigates a subtle but critical failure mode in vision-language models (VLMs): their inability to accurately assess common ground in asymmetric dialogue. The researchers found that while VLMs can process shared visual input, they systematically overestimate the extent to which conversational partners share the same interpretation of that input. In controlled experiments involving asymmetric setups—where one participant has access to visual information the other does not—VLMs failed to distinguish between "what is visible to both" and "what is mutually understood." This mismatch between perceptual overlap and interpretive alignment has direct consequences for any system deployed in collaborative tasks.
Why It Matters
This research strikes at the heart of a foundational assumption in human-AI interaction: that shared perception implies shared understanding. In practice, human collaborators constantly negotiate meaning through clarification, repair, and perspective-taking. VLMs, by contrast, appear to collapse these distinct cognitive processes into a single perceptual judgment. The implication is stark: current models lack a robust theory of mind for visual contexts. For applications like remote assistance, collaborative design, or educational tutoring—where one party may see something the other cannot—this blind spot could lead to persistent miscommunication. A VLM that assumes its partner sees what it sees will fail to flag ambiguities, offer unnecessary clarifications, or worse, act on false assumptions about shared knowledge.
Implications for AI Practitioners
Deploy with asymmetric contexts in mind. If your system involves any scenario where users have different visual access—such as a technician guiding a remote repair, or a doctor reviewing scans with a patient—test explicitly for common-ground failures. Do not assume that because the model processes the same image, it understands that the user does not. Build explicit grounding mechanisms. Relying on the model's implicit ability to track common ground is insufficient. Practitioners should implement structured dialogue policies that force the model to explicitly confirm shared understanding before proceeding with task-critical actions. This could include simple yes/no checks or asking the user to describe what they see. Evaluate beyond perceptual accuracy. Standard VLM benchmarks focus on object recognition and caption quality. This research suggests that evaluation suites must include theory-of-mind tasks—specifically, scenarios where the model must reason about what another agent does or does not know based on asymmetric access. Consider fine-tuning on grounded dialogue data. Off-the-shelf VLMs are trained primarily on static image-text pairs, not interactive grounding. Fine-tuning on datasets that include clarification, repair, and perspective-taking exchanges may help models learn to distinguish shared perception from shared interpretation.Key Takeaways
- VLMs systematically overestimate common ground in asymmetric visual dialogue, conflating shared perception with shared interpretation.
- This failure undermines collaborative tasks where users have different visual access, leading to persistent miscommunication.
- Practitioners must implement explicit grounding mechanisms and test for theory-of-mind failures, not just perceptual accuracy.
- Fine-tuning on interactive, grounded dialogue data may help models learn to track what others actually know versus what is merely visible.