Research2026-05-12

When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression

arXiv:2605.08234v1 Announce Type: cross Abstract: Long-context LLM inference is bottlenecked by the memory and bandwidth cost of reading large KV caches during decoding. KV compression reduces this cost by keeping only part of the cache, but task accuracy alone does not identify why a selector...

Read Original Article on Arxiv CS.AI

arxivpapers