LLM Serving Optimization with Variable Prefill and Decode Lengths
arXiv:2508.06133v4 Announce Type: replace-cross Abstract: We study offline scheduling for large language model (LLM) serving under a fixed KV-cache memory budget, where requests have heterogeneous prompt (prefill) and response (decode) lengths. Prompt tokens determine initial KV-cache usage, while...
The Hidden Bottleneck in LLM Inference
A new arXiv paper tackles a practical but often overlooked challenge in large language model serving: how to optimally schedule requests when both the input (prefill) and output (decode) lengths vary dramatically. The researchers propose an offline scheduling framework that operates under a fixed KV-cache memory budget, addressing the fundamental tension between memory allocation and throughput in production LLM systems.
The core problem is that KV-cache memory consumption is not uniform. A request with a 10,000-token prompt consumes far more cache than one with 100 tokens, and the decode phase adds additional memory as tokens are generated. When memory is capped — as it always is in real deployments — naive scheduling leads to either memory exhaustion or severe underutilization.
Why This Matters Now
This research arrives at a critical inflection point. As LLMs move from demo to production, serving efficiency directly impacts cost and latency. Most existing optimization work focuses on either prefill or decode in isolation, or assumes uniform request profiles. Real-world traffic, however, is deeply heterogeneous: a code generation request might have a long prompt but short output, while a creative writing task might be the reverse.
The paper’s key insight is that scheduling decisions must account for both phases simultaneously. By treating KV-cache as a finite resource that must be dynamically partitioned across concurrent requests, the authors demonstrate significant improvements in throughput and memory utilization compared to static allocation strategies.
Implications for AI Practitioners
For teams deploying LLMs at scale, this work highlights several actionable points:
First, memory budgeting for inference should not be a one-size-fits-all calculation. Practitioners need to profile their actual request distributions — prompt lengths, decode lengths, and their correlation — to set appropriate cache limits. A model serving chatbot queries with short prompts and long responses requires a different memory strategy than one handling document analysis with lengthy inputs. Second, the offline scheduling approach suggests that pre-computing optimal request batching can yield substantial gains. This is particularly relevant for batch inference pipelines, where latency constraints are looser and throughput is paramount. The paper’s framework could be integrated into existing serving systems like vLLM or TensorRT-LLM to improve their batching heuristics. Third, the fixed memory budget assumption mirrors real-world constraints. Practitioners should consider implementing admission control or request queuing policies that account for both prefill and decode memory footprints, rather than relying on simple token count limits.The research also underscores a broader trend: as LLM serving matures, the low-hanging fruit of model-level optimizations (quantization, pruning) is being exhausted, and system-level scheduling innovations are becoming the next frontier for efficiency gains.
Key Takeaways
- Heterogeneous request profiles (varying prefill and decode lengths) create significant inefficiencies in current LLM serving systems that assume uniform memory usage.
- KV-cache memory budgeting must jointly consider both input and output phases to maximize throughput under fixed memory constraints.
- Offline scheduling optimization offers a practical path to improve batch inference efficiency, particularly for production pipelines with known request distributions.
- Practitioners should audit their request patterns and consider implementing memory-aware scheduling policies rather than relying on static or naive batching heuristics.