Research2026-06-19

Cost-Optimal LLM Routing with Limited User Feedback under User Satisfaction Guarantees

arXiv:2606.19376v1 Announce Type: cross Abstract: Inference costs for large language model (LLM) applications are rapidly growing, driven by surging demand and rising infrastructure cost. Users expect high-quality responses, and in commercial settings this is formally codified in Service Level...

The escalating cost of deploying large language models (LLMs) is a pressing reality for enterprises. A new paper from arXiv (2606.19376v1) tackles this head-on by proposing a framework for cost-optimal LLM routing that operates under strict user satisfaction guarantees, even when feedback data is scarce. The core idea is deceptively simple but technically nuanced: instead of routing every query to the most powerful (and expensive) model, a system intelligently selects a cheaper model when it can satisfy the user, reserving premium models only for complex or high-stakes requests.

What Happened

The researchers have formalized the problem of "LLM routing" as a cost-minimization problem constrained by a Service Level Agreement (SLA) on user satisfaction. The critical innovation is addressing the "cold start" problem—where a system has little to no user feedback on which models work for which queries. They introduce a method that balances exploration (trying cheaper models to learn their capabilities) with exploitation (using known-good models to ensure satisfaction). The framework provides theoretical guarantees that, over time, the system will converge to a near-optimal routing policy without violating the satisfaction threshold more than an acceptable margin.

Why It Matters

This research directly attacks the economic bottleneck of LLM deployment. Currently, many organizations default to using a single, powerful model (e.g., GPT-4 or Claude 3.5 Sonnet) for all traffic, incurring high per-query costs. This paper offers a path to a tiered architecture where a mix of smaller, faster, and cheaper models handle the bulk of routine queries—such as simple summarization, translation, or FAQ responses—while expensive frontier models are reserved for complex reasoning, code generation, or creative writing.

The "limited user feedback" aspect is particularly relevant. In production, explicit thumbs-up/thumbs-down data is sparse and noisy. This framework can operate with implicit signals (e.g., user rephrasing a query, abandoning a session) or a small set of labeled examples, making it practical for real-world deployment where perfect data is unavailable.

Implications for AI Practitioners

Architecture Shift: Engineers should plan for multi-model routing layers, not single-model endpoints. This paper provides the mathematical backbone for building such routers that are provably safe (satisfaction guarantees) rather than heuristic-based.

Cost vs. Quality Trade-off Becomes Programmable: Instead of manual A/B testing to decide which model to use, practitioners can now define a satisfaction threshold (e.g., 95% user satisfaction) and let the routing algorithm minimize cost automatically. This turns a subjective judgment call into an optimization problem.

Data Strategy Changes: The focus shifts from collecting massive preference datasets to designing efficient exploration policies. A small, high-quality seed set of "hard" queries (where cheap models fail) becomes more valuable than a large, noisy dataset.

Operational Complexity Increases: Implementing this requires a robust feedback loop, a model performance matrix, and a fallback mechanism. It adds latency for the routing decision itself, though this is often negligible compared to LLM inference time.

Key Takeaways

Cost-Optimal Routing is Now Provable: The paper offers a theoretical framework to minimize LLM inference costs while guaranteeing a user satisfaction threshold, even with limited initial feedback.
Tiered Model Architecture is the Future: The default "one model for everything" approach is economically unsustainable; intelligent routing between cheap and expensive models is a critical infrastructure component.
Cold-Start Problem Addressed: The framework explicitly handles the challenge of learning which model works best for which query type when user feedback is scarce, a common real-world pain point.
Actionable for Practitioners: This provides a blueprint for building cost-aware LLM gateways, moving beyond simple load balancing to quality-constrained cost optimization.

Read Original Article on Arxiv CS.AI

arxivpapers