Research2026-06-30

GROW$^2$: Grounding Which and Where for Robot Tool Use

Originally published byArxiv CS.AI

arXiv:2606.30632v1 Announce Type: cross Abstract: Can the robot use a plate to cut a cake if no knife is available? Tool use greatly expands robot capabilities, but to use tools creatively beyond their intended functions, the robot faces the challenge of $\textit{open-world affordance grounding}$:...

The GROW$^2$ Framework: Bridging Semantic and Spatial Reasoning for Robot Tool Use

A new paper from arXiv (2606.30632) introduces GROW$^2$, a framework that tackles a fundamental challenge in robotics: enabling machines to understand what objects can do (their affordances) and where those affordances are located, even when tools are used in unconventional ways. The core contribution is a method for "open-world affordance grounding" — the ability to recognize that a plate can function as a cutting surface or a screwdriver can pry open a lid, without prior training on those specific scenarios.

What Happened

The researchers propose a dual-stream architecture that separates the problem into two interconnected tasks: semantic grounding (identifying which object properties enable a function) and spatial grounding (determining where on the object that function can be applied). For example, when a robot needs to cut a cake without a knife, GROW$^2$ identifies that flat, rigid surfaces (like a plate) share the "cutting" affordance with knives, then pinpoints the plate's edge as the optimal contact point. The system leverages large language models for semantic reasoning about object properties and vision-language models for spatial localization, combining them through a novel attention mechanism.

Why It Matters

This work addresses a critical bottleneck in practical robotics: the inability to generalize beyond pre-programmed tool uses. Current robotic systems typically require explicit training for each tool-object pair, limiting their adaptability in unstructured environments like homes or disaster zones. GROW$^2$'s approach is significant because:

Zero-shot generalization: The robot can reason about novel tool uses without task-specific training data, moving closer to human-like flexibility.
Compositional reasoning: By separating "what" from "where," the system can recombine knowledge across different contexts — understanding that a book can both hold down papers (weight affordance) and serve as a step stool (height affordance).
Practical robustness: The spatial grounding component reduces errors from purely semantic reasoning, which might know a plate can cut but not where to apply force.

Implications for AI Practitioners

For robotics engineers, this framework suggests a path toward more deployable systems. The dual-stream architecture is computationally feasible with current hardware, as it builds on existing foundation models rather than requiring custom training. Developers working on household robots or industrial automation should note that GROW$^2$ could reduce the need for exhaustive affordance datasets.

However, the paper likely leaves open questions about failure modes — particularly when semantic reasoning produces plausible but physically impossible suggestions (e.g., using a sponge as a hammer). Practitioners will need to implement safety constraints and physical simulation checks alongside the grounding system.

For AI researchers, this work reinforces the value of combining large language models with spatial reasoning. The architecture offers a template for other "grounding" problems where both semantic understanding and physical localization are required, such as in autonomous driving or surgical robotics.

Key Takeaways

GROW$^2$ enables robots to use tools creatively by separating what an object can do (semantic affordance) from where to apply it (spatial grounding)
The framework achieves zero-shot generalization to novel tool uses, reducing the need for task-specific training data
Practical implementation requires combining foundation models with physical safety checks to avoid plausible but impossible actions
The dual-stream architecture offers a reusable template for other AI systems needing both semantic reasoning and spatial localization

Read Original Article on Arxiv CS.AI

arxivpapers