BeClaude
Back to News
Research2026-04-17

Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference

Source: Arxiv CS.AI

arXiv:2604.09613v2 Announce Type: replace-cross Abstract: Production vLLM fleets provision every instance for worst-case context length, wasting 4-8x concurrency on the 80-95% of requests that are short and simultaneously triggering KV-cache failures -- OOM crashes, preemption storms, and request...

arxivpapers