Research2026-07-01

From Similarity to Vulnerability: Key Collision Attack on LLM Semantic Caching

Originally published byArxiv CS.AI

arXiv:2601.23088v2 Announce Type: replace-cross Abstract: Semantic caching has emerged as a pivotal technique for scaling LLM applications, widely adopted by major providers including AWS and Microsoft. By utilizing semantic embedding vectors as cache keys, this mechanism effectively minimizes...

The Hidden Cost of Efficiency: Semantic Caching Under Attack

A new research paper from arXiv (2601.23088v2) reveals a previously underexplored vulnerability in semantic caching systems used by major LLM providers like AWS and Microsoft. The attack, termed a "Key Collision Attack," exploits the fundamental mechanism that makes semantic caching efficient: using embedding vectors as cache keys rather than exact string matches.

Semantic caching works by storing LLM responses keyed to the semantic meaning of a query rather than its literal text. When a user asks "What's the capital of France?" and another asks "Name the capital city of France," the system recognizes the semantic similarity and serves the cached response. This dramatically reduces latency and computational costs for frequently repeated queries.

What the Research Reveals

The paper demonstrates that attackers can craft inputs that collide with cached entries from unrelated queries, effectively poisoning the cache or extracting sensitive information. By generating inputs whose embedding vectors map close to cached entries for different content, an attacker can:

Retrieve responses meant for other users (data leakage)
Corrupt the cache with malicious responses that get served to legitimate users
Exhaust cache resources by forcing mass collisions

The attack is particularly insidious because semantic caching systems are designed to be transparent—users shouldn't notice whether they received a cached or freshly generated response. This very transparency makes detection difficult.

Why This Matters

For AI practitioners, this vulnerability strikes at a core infrastructure component. Semantic caching isn't a niche feature; it's the backbone of cost-effective LLM deployment at scale. AWS, Microsoft, and other providers have invested heavily in these systems to make LLM APIs economically viable.

The implications are threefold:

Security boundaries blur – Caching systems that were designed purely for performance optimization now become attack surfaces. Organizations using shared caching infrastructure (common in multi-tenant deployments) face heightened risk.

Cost vs. security tradeoff – Disabling semantic caching or implementing strict input validation could negate its performance benefits. Practitioners must now evaluate whether the cost savings justify the security overhead.

Monitoring gaps – Most LLM monitoring tools focus on prompt injection or output toxicity, not cache behavior. This attack vector requires new detection mechanisms.

What Practitioners Should Consider

For teams deploying LLM applications, this research suggests several immediate actions:

Review whether your caching layer supports tenant isolation or uses shared keys
Implement rate limiting on semantically similar queries to detect collision attempts
Consider using authenticated cache keys that incorporate user or session identifiers
Monitor cache hit ratios for anomalies that might indicate systematic collision attacks

The paper serves as a reminder that as LLM infrastructure matures, the attack surface expands beyond the model itself to include the supporting systems that make AI practical at scale.

Key Takeaways

Semantic caching systems using embedding vectors as keys are vulnerable to collision attacks that can leak data or poison responses
Major providers (AWS, Microsoft) using these systems face a new security vector that requires infrastructure-level mitigation
AI practitioners must add cache security to their threat models, alongside prompt injection and output validation
The performance benefits of semantic caching now come with a security cost that must be actively managed

Read Original Article on Arxiv CS.AI

arxivpapers