Research2026-06-19

ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models

arXiv:2606.19965v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) are increasingly expected to act on visual information, yet the same scene may require different actions under different task contexts. How reliably can a model turn the same visual evidence into the action...

The Perception-to-Action Gap: Why Seeing Isn't Doing for Multimodal AI

A new benchmark paper, "ROSE," published on arXiv, tackles a subtle but critical failure mode in multimodal large language models (MLLMs): the inability to consistently translate visual perception into context-appropriate action. While current benchmarks test whether a model can describe an image or answer factual questions about it, ROSE probes a deeper capability—whether a model can perform the right action based on the same visual input when the task context changes.

The core insight is deceptively simple. Consider a kitchen scene with a knife on a counter. In one context, the correct action might be "pick up the knife to chop vegetables." In another, it might be "move the knife away from the child." The visual evidence is identical; only the situational framing differs. ROSE systematically evaluates whether MLLMs can dynamically adjust their outputs to these shifting contexts, revealing a significant gap between perception (seeing the knife) and action (choosing the correct response).

Why This Matters Beyond Academic Benchmarks

This research exposes a fundamental limitation in how current MLLMs process information. Most models are trained to map visual inputs to textual outputs in a relatively static manner. They excel at object recognition and scene description but struggle with the pragmatic reasoning required for action selection. The "perception-to-action gap" is not about missing visual details—it is about failing to integrate those details with task-specific goals.

For AI practitioners deploying multimodal systems in real-world applications, this has immediate implications. A customer service bot analyzing a user's screenshot might correctly identify a product but fail to determine whether the user wants a refund, a replacement, or troubleshooting advice based on the conversation history. A robotic system navigating a warehouse might see a box but not know whether to pick it up, move around it, or alert a human—depending on whether it is in restocking mode or clearing a hazard. The ROSE benchmark suggests that current models lack the contextual reasoning to make these distinctions reliably.

Implications for AI Practitioners

First, context injection is not enough. Simply appending task instructions to a prompt often fails. The model may still default to its most statistically likely response for the visual input, ignoring the context. Practitioners need to test for this failure mode explicitly, especially in safety-critical applications.

Second, fine-tuning strategies must evolve. Standard visual instruction tuning focuses on aligning images with captions or QA pairs. ROSE indicates that models need training data that explicitly varies task contexts while holding visual inputs constant, forcing the model to learn that the same image demands different actions.

Third, evaluation pipelines need updating. Benchmarks like VQA and MMMU measure knowledge and perception, but not action selection under contextual ambiguity. Teams building multimodal agents should incorporate ROSE-style tests to gauge whether their models can truly act on what they see, rather than merely describe it.

Key Takeaways

The ROSE benchmark reveals that MLLMs frequently fail to select context-appropriate actions even when they correctly perceive visual information, exposing a systematic "perception-to-action gap."
This gap poses practical risks for deployment in dynamic environments where the same visual scene requires different responses based on shifting task goals.
Current training and evaluation methods overemphasize static perception and underemphasize pragmatic action selection, requiring new data strategies and testing frameworks.
AI practitioners should proactively test for this failure mode and consider fine-tuning approaches that vary task context while holding visual input constant.

Read Original Article on Arxiv CS.AI

arxivpapersbenchmarkmultimodal