Skip to content
BeClaude
Research2026-07-01

Thinking Before Retrieving: Robust Zero-Shot Composed Image Retrieval via Strategic Planning and Self-Criticism

Originally published byArxiv CS.AI

arXiv:2606.31222v1 Announce Type: new Abstract: Composed image retrieval requires identifying a target image from a gallery by integrating a reference image with a textual modification instruction. In a training-free zero-shot setting, this task relies on constructing a retrieval-oriented textual...

What Happened

Researchers have introduced a novel zero-shot approach to composed image retrieval (CIR) that incorporates structured planning and self-criticism before executing the retrieval step. The method, described in the arXiv paper "Thinking Before Retrieving," addresses a fundamental challenge in CIR: how to combine a reference image with a textual modification instruction to find a target image, without requiring any training data for the specific task.

The key innovation is a two-stage reasoning process. First, the system generates a retrieval plan that explicitly outlines what visual features to preserve from the reference image and what changes to apply based on the text instruction. Second, it employs a self-criticism mechanism that evaluates the plan's coherence and completeness before executing the actual retrieval. This contrasts with existing zero-shot approaches that directly convert the composed query into a single text or embedding vector, often losing critical multimodal dependencies.

Why It Matters

This work is significant for several reasons. First, it tackles the practical reality that most real-world CIR applications cannot afford task-specific training data. Users want to search image galleries with queries like "find this chair but in red leather" without requiring a system pre-trained on thousands of chair-modification examples.

Second, the planning-plus-criticism architecture mirrors how humans approach visual search tasks—we don't blindly combine image and text; we reason about what to keep and what to change. By making this reasoning explicit, the method improves retrieval accuracy while also providing interpretability. A user can see why the system chose certain images based on the plan it generated.

Third, the approach demonstrates that large language models can serve as effective reasoning engines for multimodal tasks even when they lack direct training on the target domain. This suggests that strategic prompting and self-verification can partially compensate for the absence of task-specific fine-tuning.

Implications for AI Practitioners

For engineers building retrieval systems, this work offers a template for integrating LLMs into existing image search pipelines without retraining. The planning step can be implemented as a structured prompt to an LLM, and the self-criticism step as a separate verification call. This modularity means practitioners can swap in different LLMs or image encoders as they evolve.

However, the approach has clear trade-offs. The two-stage reasoning introduces latency—each query requires multiple LLM calls before retrieval begins. For real-time applications, this may be prohibitive. Additionally, the method's reliance on an LLM's reasoning capabilities means performance will vary with model quality; smaller or less capable models may produce flawed plans that degrade retrieval.

Practitioners should also consider the computational cost. While training-free, the inference cost is higher than direct embedding approaches. For high-throughput systems, a hybrid strategy might be optimal: use the planning approach for complex queries and a faster baseline for simple modifications.

Key Takeaways

  • The "thinking before retrieving" paradigm improves zero-shot composed image retrieval by explicitly planning the modification and self-criticizing the plan before executing the search.
  • This approach enables training-free deployment, making it practical for applications where task-specific data is unavailable or expensive to collect.
  • The method trades inference speed for accuracy and interpretability, requiring multiple LLM calls per query, which may limit real-time use cases.
  • Practitioners can adopt this as a modular component in existing retrieval pipelines, but should benchmark against simpler baselines to determine if the accuracy gains justify the added latency and cost.
arxivpapers