Research2026-05-06

Stochastic Sparse Attention for Memory-Bound Inference

arXiv:2605.01910v1 Announce Type: cross Abstract: Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies...

Read Original Article on Arxiv CS.AI

arxivpapers