BRIDGE: Predicting Human Task Completion Time From Model Performance
arXiv:2602.07267v2 Announce Type: replace Abstract: Evaluating the real-world capabilities of AI systems requires grounding benchmark performance in human-interpretable measures of task difficulty. Existing approaches that rely on direct human task completion time annotations are costly, noisy, and...
What Happened
A new paper on arXiv (2602.07267v2) introduces BRIDGE, a framework that predicts how long humans take to complete tasks by analyzing AI model performance. Rather than relying on expensive and noisy human annotation of task completion times, BRIDGE uses the relationship between model accuracy, confidence, and other performance metrics to estimate human task difficulty. The core insight is that if an AI system struggles with a task—exhibiting low confidence, high variance, or frequent errors—that task is likely more time-consuming for humans as well.
Why It Matters
This research addresses a persistent blind spot in AI evaluation. Current benchmarks measure model accuracy, but accuracy alone tells us little about whether a system is genuinely useful in practice. A model that achieves 95% accuracy on easy tasks but fails catastrophically on hard ones may appear strong, while a model with 90% accuracy that handles difficult cases gracefully might be more valuable. BRIDGE provides a way to translate model performance into a human-relevant metric—time—without the logistical burden of running large-scale human studies.
The implications extend beyond academic evaluation. For organizations deploying AI assistants, customer service bots, or coding copilots, understanding task difficulty in human terms is critical for setting user expectations, allocating human oversight, and designing fallback procedures. If a model can predict that a particular query will take a human 20 minutes to verify, that information can trigger escalation protocols or additional quality checks.
Implications for AI Practitioners
For AI engineers and product managers, BRIDGE offers a practical tool for cost-benefit analysis. Instead of guessing which tasks benefit most from human-in-the-loop systems, teams can use model-derived difficulty estimates to route tasks intelligently. Simple, low-difficulty items can be fully automated; complex, high-difficulty items can be flagged for human review.
The approach also opens the door to continuous monitoring. As models are updated or deployed in new domains, BRIDGE can provide ongoing estimates of whether tasks are becoming easier or harder for end users. This is particularly valuable in safety-critical applications where task time correlates with error rates—longer tasks often mean more fatigue and more mistakes.
However, practitioners should note a key limitation: BRIDGE relies on the assumption that model difficulty and human difficulty are correlated. This may break down for tasks where humans and AI use fundamentally different reasoning strategies. A task that is trivial for a human (e.g., recognizing a familiar object in a cluttered scene) may be hard for a vision model, and vice versa. Validation in specific deployment contexts will be essential.
Key Takeaways
- BRIDGE predicts human task completion time using AI model performance metrics, reducing reliance on costly human annotation studies
- This enables more nuanced AI evaluation beyond simple accuracy, grounding benchmark results in human-interpretable difficulty measures
- Practitioners can use these predictions to optimize human-AI task routing, set user expectations, and monitor task difficulty over time
- The approach assumes model-human difficulty correlation, which requires domain-specific validation before deployment in production systems