From Similarity to Vulnerability: Key Collision Attack on LLM Semantic Caching
arXiv:2601.23088v2 Announce Type: replace-cross Abstract: Semantic caching has emerged as a pivotal technique for scaling LLM applications, widely adopted by major providers including AWS and Microsoft. By utilizing semantic embedding vectors as cache keys, this mechanism effectively minimizes...
The Hidden Cost of Efficiency: Semantic Caching Under Attack
A new research paper from arXiv (2601.23088v2) reveals a previously underexplored vulnerability in semantic caching systems used by major LLM providers like AWS and Microsoft. The attack, termed a "Key Collision Attack," exploits the fundamental mechanism that makes semantic caching efficient: using embedding vectors as cache keys rather than exact string matches.
Semantic caching works by storing LLM responses keyed to the semantic meaning of a query rather than its literal text. When a user asks "What's the capital of France?" and another asks "Name the capital city of France," the system recognizes the semantic similarity and serves the cached response. This dramatically reduces latency and computational costs for frequently repeated queries.
What the Research Reveals
The paper demonstrates that attackers can craft inputs that collide with cached entries from unrelated queries, effectively poisoning the cache or extracting sensitive information. By generating inputs whose embedding vectors map close to cached entries for different content, an attacker can:
- Retrieve responses meant for other users (data leakage)
- Corrupt the cache with malicious responses that get served to legitimate users
- Exhaust cache resources by forcing mass collisions
Why This Matters
For AI practitioners, this vulnerability strikes at a core infrastructure component. Semantic caching isn't a niche feature; it's the backbone of cost-effective LLM deployment at scale. AWS, Microsoft, and other providers have invested heavily in these systems to make LLM APIs economically viable.
The implications are threefold:
- Security boundaries blur – Caching systems that were designed purely for performance optimization now become attack surfaces. Organizations using shared caching infrastructure (common in multi-tenant deployments) face heightened risk.
- Cost vs. security tradeoff – Disabling semantic caching or implementing strict input validation could negate its performance benefits. Practitioners must now evaluate whether the cost savings justify the security overhead.
- Monitoring gaps – Most LLM monitoring tools focus on prompt injection or output toxicity, not cache behavior. This attack vector requires new detection mechanisms.
What Practitioners Should Consider
For teams deploying LLM applications, this research suggests several immediate actions:
- Review whether your caching layer supports tenant isolation or uses shared keys
- Implement rate limiting on semantically similar queries to detect collision attempts
- Consider using authenticated cache keys that incorporate user or session identifiers
- Monitor cache hit ratios for anomalies that might indicate systematic collision attacks
Key Takeaways
- Semantic caching systems using embedding vectors as keys are vulnerable to collision attacks that can leak data or poison responses
- Major providers (AWS, Microsoft) using these systems face a new security vector that requires infrastructure-level mitigation
- AI practitioners must add cache security to their threat models, alongside prompt injection and output validation
- The performance benefits of semantic caching now come with a security cost that must be actively managed