Research2026-05-11

An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

arXiv:2605.07719v1 Announce Type: cross Abstract: Long-context inference increasingly operates over CPU-resident KV caches, either because decoding-time KV states exceed GPU memory capacity or because disaggregated prefill-decode systems place KV data in host memory. Although block-sparse attention...

Read Original Article on Arxiv CS.AI

arxivpapers