Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition
arXiv:2601.16211v3 Announce Type: replace-cross Abstract: Zero-Shot Compositional Action Recognition (ZS-CAR) requires recognizing novel verb-object combinations composed of previously observed primitives. In this work, we tackle a key failure mode: models predict verbs via object-driven shortcuts...
The Hidden Trap in Compositional Action Recognition
A new paper from arXiv tackles a subtle but critical failure in how AI models understand actions: the tendency to cheat by using object cues rather than genuinely recognizing verb-noun compositions. The researchers address Zero-Shot Compositional Action Recognition (ZS-CAR), where models must recognize entirely new combinations of verbs and objects—like “opening a drawer” when they’ve only seen “opening a door” and “closing a drawer” before.
The core problem is deceptively simple. When a model sees “opening a drawer,” it doesn’t necessarily learn the action “opening.” Instead, it learns that “drawers” are statistically associated with certain visual features—and then shortcuts to predicting the verb based on the object. The paper’s title, “Why Can’t I Open My Drawer?,” captures this absurdity: the model fails to generalize because it never truly understood the action in the first place.
Why This Matters Beyond Academic Benchmarks
This isn’t just an academic curiosity. Object-driven shortcuts represent a fundamental limitation in how vision-language models learn compositional concepts. In real-world applications, this failure mode has serious consequences:
- Robotics: A robot trained to “pick up a cup” might fail when asked to “pick up a book” if it learned the action “pick” as a property of cups, not as a transferable verb.
- Surveillance systems: Action recognition models that rely on object shortcuts will fail in novel environments where object-action pairings differ from training data.
- Assistive technologies: Systems designed to understand human activities from video will struggle with compositional generalization—a core requirement for real-world utility.
Implications for AI Practitioners
For those building or deploying action recognition systems, this research highlights several practical considerations:
First, standard evaluation metrics often mask shortcut learning. A model might achieve high accuracy on held-out test sets simply because object-verb correlations persist. Practitioners need to design evaluation splits that explicitly break these correlations. Second, the problem is structural, not just about more data. Adding more training examples of “opening drawers” won’t fix the underlying shortcut—it may even reinforce it. The solution requires architectural or training interventions that force the model to attend to motion and interaction dynamics, not just static object features. Third, this work connects to a broader trend in AI safety and robustness: models that rely on spurious correlations are brittle under distribution shift. The same shortcut principle applies to other domains—from medical imaging (learning hospital equipment instead of disease markers) to natural language processing (learning dataset artifacts instead of semantic understanding).Key Takeaways
- Object-driven shortcuts are a major failure mode in compositional action recognition, where models predict verbs based on object identity rather than genuine action understanding, limiting zero-shot generalization.
- Standard benchmarks may overestimate model capability by failing to control for spurious correlations between objects and actions in evaluation data.
- Mitigation requires targeted interventions—such as decorrelating object and verb representations in training—rather than simply scaling up data or model size.
- This problem extends beyond action recognition to any compositional task where models can exploit statistical shortcuts instead of learning true compositional representations.