Research2026-06-30

SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios

Originally published byArxiv CS.AI

arXiv:2511.17649v4 Announce Type: replace-cross Abstract: Tangible control interfaces (TCIs), such as appliance panels, remotes, elevators, and embedded GUIs, are a fundamental component of everyday human-built environments. Interacting with these interfaces requires agents not only to ground...

The Benchmark That Forces AI to Actually Touch the World

A new benchmark called SWITCH has been released, targeting a critical blind spot in embodied AI research: tangible control interfaces (TCIs). While most existing benchmarks focus on navigation, object manipulation, or high-level planning, SWITCH specifically evaluates an agent’s ability to interact with the physical interfaces that humans use daily—elevator buttons, appliance panels, remote controls, and embedded touchscreens. The benchmark requires agents to perform long-horizon tasks that involve locating, interpreting, and physically actuating these interfaces in realistic 3D environments.

Why This Matters Beyond Academic Benchmarks

The gap SWITCH addresses is not trivial. Current state-of-the-art embodied agents can navigate a house, but many fail at the simple act of pressing the correct elevator button or operating a microwave panel. This is because TCIs present a unique challenge: they are visually sparse (a few buttons on a flat surface), functionally ambiguous (what does this unlabeled dial do?), and require precise spatial reasoning (press the third button from the left, not the second). Moreover, these interfaces often have non-linear state changes—a button press might trigger a delayed response or a multi-step sequence.

The practical implications are significant. Any real-world deployment of embodied AI—whether in home robotics, warehouse automation, or assistive technology—will inevitably encounter TCIs. A robot that can navigate a hospital but cannot call an elevator is functionally useless. By explicitly benchmarking this capability, SWITCH forces the research community to move beyond “look and plan” paradigms toward “look, plan, and physically interact with constrained interfaces.”

Implications for AI Practitioners

For those building embodied systems, SWITCH highlights several technical bottlenecks:

Perception under ambiguity. TCIs often lack clear visual affordances. A flat glass panel with a single unlabeled button requires the agent to infer function from context—a skill that current vision-language models handle poorly without explicit training. Action precision and error recovery. Unlike grasping a cup, pressing a small button requires millimeter-level precision. The benchmark likely reveals that reinforcement learning policies trained on coarse manipulation tasks fail when the action space is constrained and the penalty for mis-pressing is high (e.g., calling the wrong floor). Long-horizon planning with stateful interfaces. Many TCIs have hidden states—an elevator remembers which floor was pressed, a microwave timer counts down. Agents must maintain internal state representations across multiple steps, which current transformer-based architectures handle inconsistently.

For practitioners, the immediate takeaway is that off-the-shelf vision-language models and navigation policies are insufficient for real-world interface interaction. Fine-tuning on interface-specific datasets, incorporating tactile feedback simulation, and designing hierarchical policies that separate “which button” from “how to press it” will be necessary.

Key Takeaways

SWITCH fills a specific gap in embodied AI evaluation by focusing on tangible control interfaces, which are ubiquitous in human environments but underrepresented in existing benchmarks.
The benchmark exposes weaknesses in current agents’ ability to handle visually sparse, functionally ambiguous interfaces that require precise physical interaction.
Practitioners should expect that general-purpose embodied policies will underperform on TCI tasks, necessitating specialized perception and control modules.
Long-horizon tasks involving stateful interfaces (e.g., elevators, appliances) will require better internal state tracking and error recovery mechanisms than current architectures provide.

Read Original Article on Arxiv CS.AI

arxivpapersbenchmark