Skip to content
BeClaude
Research2026-07-01

RCT: A Robot-Collected Touch-Vision-Language Dataset for Tactile Generalization

Originally published byArxiv CS.AI

arXiv:2606.31694v1 Announce Type: cross Abstract: For robots manipulating open-world objects, tactile representations must generalize to unseen materials. We introduce RCT (Robotic Contact Tactile), a robot-collected touch-vision-language dataset with 29,279 tactile frames from full robot presses...

What Happened

Researchers have released RCT (Robotic Contact Tactile), a multimodal dataset containing 29,279 tactile frames captured during full robot presses on various objects. Unlike prior tactile datasets that rely on static sensor readings or simulated data, RCT is collected through actual robotic manipulation, pairing each tactile frame with corresponding vision and language annotations. The dataset is designed to help robots generalize tactile understanding to materials they have never encountered before—a critical gap in current manipulation systems.

Why It Matters

Tactile sensing remains one of the least developed modalities in embodied AI. Vision and language models have advanced rapidly, but robots still struggle to infer surface properties like hardness, friction, or texture from visual cues alone. This limitation becomes acute in open-world settings where a robot might encounter a novel object—say, a silicone spatula when it has only trained on rubber balls and wooden blocks.

RCT addresses this by providing a large-scale, robot-collected corpus that aligns tactile feedback with visual observations and natural language descriptions. The key innovation is the collection methodology: full robot presses ensure consistent, repeatable contact events across diverse materials, generating data that captures the dynamic relationship between applied force and surface deformation. This is fundamentally different from static touch datasets that miss the temporal and force-dependent aspects of real manipulation.

For AI practitioners, RCT offers a pathway toward more robust tactile generalization. Current tactile models often overfit to specific sensor types or material categories; a dataset built on actual robotic interaction can help models learn invariant features that transfer across unseen materials. The inclusion of language annotations also opens the door to multimodal reasoning—a robot could, in theory, understand a verbal instruction like "grip this gently, it's fragile" by correlating language with tactile signatures learned from RCT.

Implications for AI Practitioners

  • Multimodal alignment challenges: RCT provides a testbed for aligning tactile, visual, and linguistic representations. Practitioners working on fusion architectures (e.g., cross-attention transformers) will find this dataset valuable for benchmarking how well models can integrate touch into existing vision-language pipelines.
  • Sim-to-real transfer: Because RCT is robot-collected, it avoids the domain gap that plagues simulated tactile data. Researchers developing sim-to-real tactile models can use RCT as a real-world validation set or as a source of priors for fine-tuning.
  • Material property prediction: The dataset enables supervised learning tasks such as predicting material hardness, surface roughness, or friction coefficient from tactile sequences. Practitioners building robotic grasping systems can use these predictions to adjust grip force or grasp strategy in real time.
  • Data efficiency and scaling: At ~29,000 frames, RCT is modest by vision dataset standards but large for tactile data. Practitioners should consider whether this scale is sufficient for training foundation models, or whether it serves better as a benchmark for evaluating few-shot or zero-shot tactile generalization methods.

Key Takeaways

  • RCT is a robot-collected tactile dataset pairing 29,279 press frames with vision and language annotations, designed to improve tactile generalization to unseen materials.
  • The dataset addresses a critical gap in embodied AI: enabling robots to infer surface properties from touch when visual cues are insufficient.
  • AI practitioners can use RCT for multimodal alignment research, material property prediction, and as a real-world benchmark for sim-to-real tactile transfer.
  • At current scale, RCT is best suited for fine-tuning or evaluation rather than training large foundation models from scratch.
arxivpapersvision