Budgeted Act-or-Defer Multi-Agent LLM Deliberation with Local Reliability Bounds
arXiv:2606.29654v1 Announce Type: new Abstract: Multi-agent deliberation among LLMs can improve reasoning, but deployment requires deciding when the current answer is reliable enough to act on and when it should be escalated to human review. We formulate this as budgeted act-or-defer decision...
This paper tackles a practical bottleneck in deploying multi-agent LLM systems: the cost and latency of endless deliberation. While the field has focused on improving accuracy through more agents and more rounds of debate, this work introduces a formal framework for knowing when to stop. The core contribution is a “budgeted act-or-defer” mechanism that uses local reliability bounds to decide if a consensus answer is trustworthy enough to execute, or if it should be escalated to a human reviewer.
What Happened
The researchers formalize multi-agent LLM deliberation as a decision problem under a fixed budget (e.g., token cost, API calls, or time). Instead of running a fixed number of debate rounds, the system dynamically evaluates the reliability of the current consensus. It computes a “local reliability bound” — a statistical or probabilistic estimate of how likely the current answer is correct given the observed agent agreement and disagreement patterns. If this bound exceeds a threshold, the system “acts” (outputs the answer). If not, it either continues deliberation or “defers” to a human. This creates a principled trade-off: spend budget on more agents or rounds only when uncertainty is high, and escalate to humans only when the system cannot achieve sufficient confidence within budget.
Why It Matters
This addresses a critical gap in current multi-agent LLM architectures. Today, most systems either run a fixed number of rounds (wasting budget on easy problems) or use arbitrary stopping criteria like “agreement among N agents” without statistical rigor. The act-or-defer framing is particularly important for high-stakes applications — medical diagnosis, legal analysis, financial auditing — where false positives from overconfident LLMs are unacceptable. By providing a formal reliability bound, the system can guarantee a minimum confidence level before acting, and explicitly budget for human review when that confidence cannot be met.
For AI practitioners, this shifts the conversation from “how many agents do we need?” to “how much reliability do we need for this specific decision?”. The framework allows teams to tune a single threshold parameter that controls the cost-reliability frontier, rather than manually engineering deliberation protocols.
Implications for AI Practitioners
First, this enables cost-aware deployment. Practitioners can now set a budget (e.g., $0.10 per query) and let the system automatically allocate that budget across agent calls and potential human escalation. Second, it provides a principled audit trail: every decision comes with a recorded reliability bound, making it easier to debug failures and justify system behavior to regulators. Third, the local reliability bounds approach is computationally lightweight compared to full Bayesian inference, making it feasible for real-time applications.
The main limitation is that the reliability bounds depend on the quality of the underlying agent agreement model. If agents are systematically biased (e.g., all trained on similar data), agreement may not indicate correctness. Practitioners will need to calibrate these bounds against ground truth data for their specific domain.
Key Takeaways
- Dynamic stopping reduces cost: The act-or-defer framework eliminates wasteful deliberation on easy problems by stopping as soon as a reliability bound is met.
- Formalizes the human-in-the-loop decision: Instead of ad-hoc escalation rules, the system mathematically determines when human review is necessary based on budget and confidence thresholds.
- Enables cost-reliability tuning: A single threshold parameter controls the trade-off between acting cheaply and deferring to expensive human review, allowing teams to optimize for their specific budget and accuracy requirements.
- Requires domain-specific calibration: The reliability bounds are only as good as the underlying model of agent agreement, necessitating careful validation against real-world data before deployment in production.