BeClaude
Research2026-05-01

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

Source: Arxiv CS.AI

arXiv:2604.09408v3 Announce Type: replace Abstract: Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when to act autonomously and when to ask for help....

arxivpapersagentsbenchmark