When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents
arXiv:2606.23937v1 Announce Type: cross Abstract: Exact-match retrieval recall is often used as a proxy for whether a retriever supplies useful policy context to a downstream decision model. We test this proxy for pre-action policy classification in tau-bench using Qwen2.5-3B/7B classifiers. Under...
When Retrieval Metrics Mislead
A new preprint from arXiv (2606.23937) challenges a fundamental assumption in retrieval-augmented generation (RAG) systems: that exact-match recall of relevant documents reliably indicates whether a downstream agent will actually use that information for decision-making. The researchers tested this by evaluating Qwen2.5-3B and 7B classifiers on pre-action policy classification tasks within the tau-bench environment, which simulates long-horizon tool-use scenarios.
The core finding is that standard retrieval metrics—specifically recall—can be poor proxies for what the authors call "policy signal." A retriever might surface the correct document, but the downstream model may fail to extract or act upon the relevant policy information contained within it. This disconnect becomes especially pronounced in long-horizon tasks where agents must integrate multiple pieces of context over extended sequences of tool calls.
Why This Matters
For AI practitioners, this work exposes a dangerous blind spot in current evaluation practices. Many production systems optimize retrieval pipelines based on recall@k or mean reciprocal rank, implicitly assuming that better retrieval automatically yields better downstream performance. This paper suggests that assumption is flawed in at least two ways:
First, the relationship between retrieval accuracy and agent behavior is non-linear. A retriever achieving 95% recall may still cause policy failures if the 5% of missed documents are critical for specific decision points. Second, even when correct documents are retrieved, the agent's ability to parse and apply them varies with model size, context window utilization, and task complexity.
The tau-bench environment is particularly relevant because it mimics real-world scenarios where agents must follow complex, evolving policies—such as customer support workflows, compliance checks, or multi-step data processing pipelines. In these settings, a single missed or misapplied policy can cascade into significant errors.
Implications for AI Practitioners
Rethink evaluation metrics. Teams should supplement retrieval metrics with direct measurements of downstream task performance, especially for safety-critical or policy-driven applications. A dashboard showing 99% recall may conceal systematic failures in how agents apply retrieved information. Invest in policy-aware retrieval. Rather than optimizing for generic relevance, retrieval systems may need to be tuned for "decision-critical" documents—those that directly influence agent actions at specific junctures. This could involve weighting documents by their impact on downstream behavior rather than their lexical or semantic similarity to queries. Test across model scales. The paper's use of both 3B and 7B classifiers highlights that model size affects how well retrieved information is utilized. Smaller models may require more explicit signal in retrieved documents, while larger models might benefit from richer context even if retrieval precision is lower.Key Takeaways
- Standard retrieval metrics like exact-match recall can significantly overestimate how well an agent will actually use retrieved policy information for decision-making.
- The gap between retrieval accuracy and downstream task performance grows in long-horizon, multi-step tool-use scenarios where context must be integrated over time.
- Practitioners should directly measure agent behavior on policy-critical tasks rather than relying solely on retrieval benchmarks.
- Model size and architecture influence how effectively retrieved information is applied, suggesting that retrieval systems should be co-optimized with the downstream agent rather than evaluated in isolation.