Research2026-07-02

Learning to Compose: Revisiting Proxy Task Design for Zero-Shot Composed Image Retrieval

Originally published byArxiv CS.AI

arXiv:2607.00374v1 Announce Type: cross Abstract: Composed Image Retrieval (CIR) retrieves a target image from a reference image and a textual modification. While supervised CIR relies on costly triplets, Zero-Shot CIR (ZS-CIR) alleviates this reliance through proxy tasks trained on image-text...

Rethinking How AI Learns to “Edit” Images with Language

A new paper from arXiv revisits a foundational challenge in multimodal AI: how to retrieve an image that matches a reference image plus a textual instruction—without expensive human-annotated training data. The work, “Learning to Compose: Revisiting Proxy Task Design for Zero-Shot Composed Image Retrieval,” tackles the zero-shot setting of Composed Image Retrieval (CIR), where models must generalize from proxy tasks rather than supervised triplets.

The core problem is straightforward but hard: given a picture of a red dress and the query “make it blue and add sleeves,” the system must find the correct target image. Supervised approaches require thousands of (reference, text, target) triplets, which are costly to collect. Zero-shot methods instead train on proxy tasks using readily available image-text pairs—for example, learning to reconstruct captions or match images to descriptions. The paper systematically investigates which proxy task designs actually transfer to the CIR task, revealing that many common assumptions about what makes a good proxy are suboptimal.

Why This Matters for Multimodal AI

This research addresses a bottleneck that limits practical deployment of CIR systems. E-commerce, design tools, and visual search engines all benefit from the ability to modify visual queries with natural language. Currently, most production systems either require massive curated datasets or rely on brittle heuristics. By rigorously analyzing proxy task design, the paper provides a roadmap for building more sample-efficient models.

The key insight is that not all proxy tasks are created equal. Tasks that force the model to reason about compositional changes—where the text modifies specific attributes of the image—generalize far better than tasks that simply match global image-caption similarity. This aligns with broader findings in vision-language learning: models need to understand which parts of an image change when a description is altered, not just whether the description is globally relevant.

Implications for AI Practitioners

For engineers building retrieval systems, this work offers practical guidance. First, it suggests that investing in proxy task design—particularly tasks that require local attribute grounding—can reduce or eliminate the need for expensive triplet annotation. Second, it highlights the importance of evaluation protocols: many zero-shot CIR methods report strong numbers on standard benchmarks but fail on compositional edits that require precise attribute changes. Practitioners should test their models on such edge cases before deployment.

The paper also underscores a broader lesson: as multimodal models grow larger, the choice of training objective matters as much as scale. A well-designed proxy task on a moderate-sized model can outperform a poorly designed one on a much larger model. This is encouraging for teams with limited compute, as it shifts the focus from brute-force scaling to smarter data and objective design.

Key Takeaways

Zero-shot CIR can match supervised performance if proxy tasks are designed to teach compositional reasoning about attribute changes, not just global image-text alignment.
Practitioners should prioritize proxy tasks that require local grounding (e.g., “change color of object X”) over generic caption-matching objectives.
Standard benchmarks may overestimate real-world performance; models should be stress-tested on compositional edits that involve multiple attribute modifications.
The paper provides a framework for systematically evaluating proxy task designs, offering a practical toolkit for teams building multimodal retrieval systems without expensive annotations.

Read Original Article on Arxiv CS.AI

arxivpapers