Research2026-04-20
Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU
Source: Arxiv CS.AI
arXiv:2604.15464v1 Announce Type: cross Abstract: Large Language Model (LLM) deployment is increasingly shifting to cost-efficient accelerators like Google's Tensor Processing Units (TPUs), prioritizing both performance and total cost of ownership (TCO). However, existing LLM inference kernels and...
arxivpapersrag