RoboSSM: Scalable In-context Imitation Learning via State-Space Models
arXiv:2509.19658v2 Announce Type: replace-cross Abstract: In-context imitation learning (ICIL) enables robots to learn tasks from prompts consisting of just a handful of demonstrations. By eliminating the need for parameter updates at deployment time, this paradigm supports few-shot adaptation to...
What Happened
A new research paper, "RoboSSM: Scalable In-context Imitation Learning via State-Space Models," proposes replacing the Transformer architecture—the dominant backbone for in-context imitation learning (ICIL)—with State-Space Models (SSMs). The core innovation is that SSMs, such as Mamba, can process long demonstration sequences more efficiently than Transformers, which suffer from quadratic attention costs. The authors demonstrate that RoboSSM achieves competitive or superior performance on standard robot manipulation benchmarks while offering significantly better scalability in terms of sequence length and computational cost.
Why It Matters
This work addresses a critical bottleneck in current robotic learning pipelines. In-context imitation learning allows a robot to watch a few demonstrations and immediately replicate the behavior without retraining—a powerful capability for few-shot adaptation. However, as the number of demonstrations or the length of each trajectory grows, Transformer-based models become prohibitively slow and memory-intensive. The quadratic complexity of self-attention limits how many examples a robot can "see" in a single prompt.
By leveraging SSMs, which have linear complexity in sequence length, RoboSSM opens the door to scaling ICIL to much larger context windows. This means a robot could potentially ingest dozens or hundreds of demonstrations in a single prompt, capturing more nuanced behaviors and edge cases without needing to fine-tune. For practitioners, this is a practical step toward deployment-ready systems that can adapt on the fly in dynamic environments—such as warehouse picking or surgical assistance—where retraining is infeasible.
The paper also reinforces a broader trend: SSMs are emerging as a viable alternative to Transformers for sequential decision-making tasks, not just language modeling. While Transformers remain superior for very large-scale pretraining, SSMs offer a compelling efficiency advantage for real-time or resource-constrained robotics applications.
Implications for AI Practitioners
For robotics engineers and AI researchers, RoboSSM suggests several actionable insights:
- Re-evaluate architecture choices: If your ICIL system is hitting context-length limits, switching to an SSM backbone could yield substantial speed and memory gains without sacrificing task performance.
- Expect trade-offs in expressiveness: SSMs may not capture long-range dependencies as richly as Transformers in all scenarios. Practitioners should benchmark both architectures on their specific task distributions before committing.
- Prepare for hybrid systems: The most robust future systems may combine SSMs for efficient processing of long demonstration histories with smaller Transformer modules for fine-grained reasoning.
- Monitor hardware compatibility: SSMs like Mamba are optimized for GPU parallelism, but their performance on edge devices (e.g., embedded robot controllers) remains an open question.
Key Takeaways
- RoboSSM replaces the Transformer with State-Space Models to achieve linear-time in-context imitation learning, enabling robots to process far more demonstrations in a single prompt.
- This work directly addresses the scalability bottleneck of ICIL, making few-shot adaptation more practical for real-world deployment where retraining is not an option.
- Practitioners should consider SSM backbones for robotics tasks that require long context windows, but should validate performance on their specific tasks due to architectural trade-offs.
- The research signals a broader shift toward efficient sequence models in robotics, potentially reducing the computational cost of deploying adaptive robot policies.