Research2026-04-17

Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference

arXiv:2604.09613v2 Announce Type: replace-cross Abstract: Production vLLM fleets provision every instance for worst-case context length, wasting 4-8x concurrency on the 80-95% of requests that are short and simultaneously triggering KV-cache failures -- OOM crashes, preemption storms, and request...

Read Original Article on Arxiv CS.AI

arxivpapers