Research2026-06-24

NoContactNoWorries: Estimating Contact through Vision and Proprioception for In-Hand Dexterous Manipulation

arXiv:2606.24450v1 Announce Type: cross Abstract: Perceiving physical contact is fundamental to dexterous manipulation. While robots often rely on dedicated hardware tactile sensors, humans exhibit a remarkable ability to infer contact by integrating visual information with an innate sense of their...

What Happened

Researchers have introduced a novel framework called "NoContactNoWorries" that enables dexterous robotic hands to estimate physical contact during manipulation tasks using only vision and proprioception—without dedicated tactile sensors. The system integrates visual observations of the hand-object interaction with internal joint position and force data to infer where and when contact occurs. This approach mimics the human ability to sense contact through sight and bodily awareness, bypassing the need for expensive, fragile, or hard-to-integrate tactile hardware.

The work, published on arXiv, demonstrates that contact estimation can be achieved with high accuracy by fusing these two readily available sensing modalities. The framework likely employs a learned model that correlates visual cues (e.g., finger proximity to an object, deformation shadows) with proprioceptive signals (joint angles, torques) to predict contact points and forces.

Why It Matters

This research addresses a critical bottleneck in dexterous manipulation: the reliance on tactile sensing. Hardware tactile sensors, while powerful, introduce complexity in calibration, durability, and integration—especially for small or irregularly shaped objects. By showing that contact can be estimated without them, the work lowers the barrier to entry for robust in-hand manipulation.

For AI practitioners, the implication is significant. Many robotic systems already have cameras and joint encoders. This framework suggests that existing hardware can be repurposed for contact-aware manipulation without additional sensor investment. It also aligns with trends in "sim-to-real" transfer, where models trained in simulation (where tactile data is perfect) can be deployed on real systems using vision and proprioception as proxies.

The human analogy is instructive: we rarely need fingertip sensors to know we’re holding a cup—we see it and feel the resistance. This work formalizes that intuition into a computational pipeline, potentially making dexterous manipulation more scalable and robust.

Implications for AI Practitioners

Sensor fusion over specialization: The success of this approach reinforces the value of combining multiple imperfect signals (vision + proprioception) to approximate a missing modality (touch). Practitioners should consider whether their own systems can achieve similar results by fusing existing sensors rather than adding new ones.

Training data generation: The framework likely requires paired data of visual-proprioceptive inputs with ground-truth contact labels. This can be generated in simulation or through clever real-world setups (e.g., instrumented objects). Practitioners should plan for this data collection pipeline, which may be non-trivial.

Real-time constraints: Contact estimation for manipulation must be fast—ideally under 10ms. The computational cost of fusing vision (often via CNNs) with proprioception (low-dimensional) must be optimized. Edge deployment or model compression may be necessary for practical use.

Failure modes: Without tactile feedback, the system may struggle in visually occluded scenarios (e.g., fingers fully covering the object) or when proprioceptive noise is high. Practitioners should characterize these edge cases and consider fallback strategies.

Key Takeaways

NoContactNoWorries demonstrates that contact estimation for dexterous manipulation is feasible using only vision and proprioception, eliminating the need for dedicated tactile sensors.
This reduces hardware complexity and cost, making contact-aware manipulation more accessible for real-world robotic systems.
AI practitioners should explore sensor fusion strategies to approximate missing modalities, but must account for data collection, real-time performance, and occlusion-related failure modes.
The work highlights a promising direction: leveraging human-inspired multimodal perception to simplify robotic hardware without sacrificing manipulation capability.

Read Original Article on Arxiv CS.AI

arxivpapersvision