Research2026-05-08
Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility
Source: Arxiv CS.AI
arXiv:2605.06105v1 Announce Type: new Abstract: Long-context inference in decoder-only language models is costly because long prompts are processed during Prefill, cached at every layer, and repeatedly attended to during autoregressive Decode. We introduce \emph{Shallow Prefill, dEEp Decode}...
arxivpapers