Research2026-05-06
Stochastic Sparse Attention for Memory-Bound Inference
Source: Arxiv CS.AI
arXiv:2605.01910v1 Announce Type: cross Abstract: Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies...
arxivpapers