Back to News
Research2026-04-17
Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference
Source: Arxiv CS.AI
arXiv:2604.09613v2 Announce Type: replace-cross Abstract: Production vLLM fleets provision every instance for worst-case context length, wasting 4-8x concurrency on the 80-95% of requests that are short and simultaneously triggering KV-cache failures -- OOM crashes, preemption storms, and request...
arxivpapers