Research2026-06-26

MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agentic RAG

arXiv:2606.26793v1 Announce Type: cross Abstract: Multimodal agentic retrieval-augmented generation (RAG) systems expand the attack surface beyond prompt injection to include text poisoning, image injection, direct-query attacks, and orchestrator-level tool manipulation. Existing red-teaming...

Multimodal agentic RAG systems represent a significant leap beyond simple chat interfaces, but with that power comes a vastly expanded attack surface. A new preprint, "MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agentic RAG," directly confronts this vulnerability, proposing a sophisticated automated red-teaming framework designed to find weaknesses before bad actors do.

The core problem is that agentic RAG systems do more than retrieve text. They process images, execute tool calls, and follow multi-step orchestration plans. This creates novel attack vectors: an attacker could poison a knowledge base with a malicious image that, when retrieved, alters the agent’s behavior; they could craft a direct query that manipulates which tool the orchestrator selects; or they could inject hidden instructions into a retrieved document. Traditional red-teaming, which often relies on manual effort or simple perturbation, cannot scale to cover this combinatorial complexity.

MIRROR addresses this by combining Monte Carlo Tree Search (MCTS) with a novelty-constrained memory mechanism. The MCTS component systematically explores the space of possible attack sequences—mixing text, images, and tool instructions—to find the most effective adversarial inputs. Crucially, the "novelty constraint" prevents the framework from repeatedly generating similar attacks, forcing it to discover diverse vulnerabilities. The memory component stores previously successful attack patterns, allowing the system to guide future searches more efficiently, much like a human red-teamer learns from past exploits.

Why it matters for AI practitioners. This research signals that the era of treating RAG security as a simple "prompt injection" problem is over. For teams building production agentic systems, the implications are immediate. First, you cannot rely solely on input sanitization or output filtering; the attack surface is too distributed across modalities and tool calls. Second, automated red-teaming is becoming a necessity, not a luxury. MIRROR provides a blueprint for how to systematically stress-test these systems, but it also implies that your evaluation pipeline must now include multimodal adversarial testing.

For Claude and other frontier model providers, this work underscores the need for robust, layered defenses. It suggests that future safety benchmarks must include agentic scenarios—where the model acts on retrieved data—rather than just static question-answering. Practitioners should expect that their RAG pipelines will be probed across every modality and tool boundary, and that traditional "jailbreak" lists will be insufficient.

Key Takeaways

Expanded threat model: Multimodal agentic RAG systems face unique attacks (image injection, tool manipulation, orchestrator-level exploits) that go far beyond standard prompt injection.
Automated red-teaming is essential: Manual testing cannot cover the combinatorial attack space; frameworks like MIRROR using MCTS and novelty constraints offer a scalable solution.
Defense must be multimodal: Security strategies must account for adversarial inputs across text, images, and tool-calling sequences, not just natural language prompts.
Production systems need new evaluation: Teams should incorporate agentic red-teaming into their CI/CD pipelines and safety benchmarks to catch vulnerabilities before deployment.

Read Original Article on Arxiv CS.AI

arxivpapersagentsrag