Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs
arXiv:2607.02466v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models are fundamentally bottlenecked by the scarcity of expert demonstrations -- triplets of observations, instructions, and actions that are costly to collect at scale. We argue that this bottleneck stems from...
The Pretraining Bottleneck in Robot Learning
A new arXiv preprint (2607.02466) tackles a fundamental limitation of Vision-Language-Action (VLA) models: the scarcity of expert demonstrations. The authors argue that current VLA approaches are bottlenecked by the need for costly, task-specific triplets of observations, instructions, and actions. Their proposed solution—task-agnostic pretraining—suggests that robots can learn general movement capabilities before being fine-tuned for specific tasks, analogous to how language models benefit from broad pretraining before instruction tuning.
Why This Matters
The core insight here is structural. Current VLA training requires collecting expert demonstrations for every target task, which is prohibitively expensive at scale. A robot that needs to learn 100 manipulation tasks would need 100 separate demonstration datasets. The paper’s argument—that movement primitives can be learned independently of task semantics—could dramatically reduce this data burden.
If validated, this approach would represent a shift from “learn to do” to “learn to move, then learn to do.” The robot first acquires general motor skills (grasping, pushing, reaching) from diverse, unlabeled interaction data, then maps task instructions onto these pre-trained movement capabilities using far fewer expert demonstrations. This mirrors the successful paradigm in NLP where models pretrain on broad text corpora before task-specific fine-tuning.
Implications for AI Practitioners
For robotics researchers and engineers, several practical considerations emerge:
Data strategy changes. Teams should prioritize collecting diverse, task-agnostic interaction data (e.g., random arm movements, object interactions without specific goals) over task-specific demonstrations. This data is cheaper to collect and can be reused across multiple downstream tasks. Architecture design choices. The pretraining phase likely requires different model architectures than end-to-end VLA training. Practitioners may need to decouple movement representation from task understanding, potentially using separate encoders or modular networks that can be composed during fine-tuning. Evaluation metrics must evolve. Current benchmarks measure task completion rates, but task-agnostic pretraining requires metrics that assess movement quality, diversity, and adaptability. Researchers will need new evaluation protocols that measure how well pretrained movement representations transfer to unseen tasks. Computational trade-offs. Pretraining on diverse movement data may require significant compute, but this upfront cost could be amortized across many downstream tasks. Teams with limited resources might prioritize smaller, carefully curated movement datasets over massive, uncurated ones.The paper’s core thesis—that movement and task understanding can be decoupled in VLA learning—challenges the prevailing paradigm of end-to-end training from expert demonstrations. If successful, this could unlock more sample-efficient robot learning, bringing us closer to generalist robots that can adapt to new tasks with minimal human effort.
Key Takeaways
- Task-agnostic pretraining for VLAs proposes learning general movement capabilities before task-specific fine-tuning, reducing dependence on costly expert demonstrations
- This approach mirrors the NLP paradigm of broad pretraining followed by instruction tuning, potentially enabling more sample-efficient robot learning
- Practitioners should consider collecting diverse, unlabeled interaction data alongside task-specific demonstrations, and may need to adopt modular architectures that separate movement from task understanding
- New evaluation metrics are needed to assess the quality and transferability of pretrained movement representations across diverse downstream tasks