Human2Any: Human-to-Robot Transfer via Constraint-Aware Compositional Planning
arXiv:2606.28813v1 Announce Type: cross Abstract: Human videos are a scalable source of supervision for robot manipulation, as they are abundant and naturally capture rich object interactions. However, transferring human demonstrations to robots remains challenging due to embodiment mismatch, scene...
Bridging the Embodiment Gap: Human2Any and the Promise of Scalable Robot Learning
The latest preprint from arXiv, "Human2Any," tackles one of the most persistent bottlenecks in robotic manipulation: the inability to directly transfer the vast wealth of human demonstration data to robot platforms. The core problem is the "embodiment mismatch"—a human hand has different kinematics, degrees of freedom, and sensory feedback than a robot gripper. Human2Any proposes a constraint-aware compositional planning framework that decomposes human actions into reusable, robot-agnostic primitives, then re-synthesizes them under the physical constraints of the target robot.
What happenedThe researchers introduce a method that first extracts task-relevant constraints from human video demonstrations—such as object affordances, contact points, and spatial relationships. These constraints are then composed into a planning graph that is independent of any specific robot morphology. When a robot needs to execute the task, the system maps these abstract constraints onto the robot's own kinematic and dynamic capabilities, generating a feasible motion plan. This contrasts with prior work that either requires expensive teleoperation data or relies on domain randomization that often fails to generalize across drastically different robot arms.
Why it mattersThe implications are significant for the scalability of robot learning. Human videos are abundant—YouTube alone hosts millions of hours of unlabeled manipulation footage. If Human2Any’s approach proves robust, it could dramatically reduce the need for costly, robot-specific data collection. For AI practitioners, this represents a shift away from training monolithic end-to-end policies that are brittle to embodiment changes, toward a more modular, compositional paradigm. The constraint-aware aspect is particularly critical: by explicitly modeling what must remain invariant (e.g., "the cup must remain upright") versus what can vary ("the robot can approach from the left or right"), the system achieves both flexibility and safety.
Implications for AI practitionersFor those building real-world robot systems, this work suggests three actionable insights. First, investing in constraint extraction from video—rather than direct imitation—may yield higher transferability across different hardware. Second, compositional planning offers a path to zero-shot generalization: if you can decompose a task into primitives, you can recombine them for novel robots without retraining. Third, the approach implicitly addresses the sim-to-real gap by grounding plans in physical constraints rather than pixel-level imitation. Practitioners should watch for follow-up work that evaluates Human2Any on diverse robot platforms (e.g., bimanual arms, mobile manipulators) and under noisy perception conditions, as real-world video is rarely clean.
Key Takeaways
- Human2Any introduces a constraint-aware compositional planning framework that transfers human demonstrations to robots by focusing on invariant task constraints rather than direct imitation.
- The approach could dramatically reduce the need for expensive robot-specific data collection by leveraging abundant human video sources.
- For AI practitioners, the key insight is that decomposing tasks into reusable, embodiment-agnostic primitives enables zero-shot transfer across different robot morphologies.
- Critical next steps include validation on diverse hardware and under real-world perception noise, which will determine the method's practical utility beyond controlled lab settings.