Research2026-06-24

GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents

arXiv:2606.24551v1 Announce Type: new Abstract: Computer-use agents can execute software tasks through either graphical interfaces or programmatic command interfaces, but existing evaluations confound interaction modality with differences in tasks, initial states, verifiers, and permitted actions....

This new preprint from arXiv tackles a surprisingly overlooked question in the development of computer-use agents: are we measuring the agent, or are we measuring the interface? The paper systematically compares Graphical User Interface (GUI) agents against Command Line Interface (CLI) agents, controlling for task complexity, initial states, and verification methods—a rigor often missing from prior benchmarks.

What the Research Reveals

The core finding is that GUI agents suffer from significant “execution bottlenecks” that are largely absent in CLI agents. These bottlenecks are not about reasoning ability, but about perception and action latency. A GUI agent must parse screen pixels, locate buttons, and simulate mouse clicks—each step introduces noise and delay. In contrast, a CLI agent issues a single text command and receives a structured response. The paper demonstrates that even when both agents have equivalent underlying language models, the CLI variant consistently completes tasks faster and with fewer errors, particularly in multi-step workflows.

Crucially, the research isolates the interaction modality as the primary variable. This means that many existing benchmarks claiming to evaluate “agent intelligence” may actually be evaluating the inefficiency of screen parsing and action execution. The authors propose a framework to disentangle these factors, urging the field to adopt standardized baselines that account for modality-specific overhead.

Why This Matters

For AI practitioners, this paper delivers a sobering reality check. The industry has poured immense resources into GUI agents—think of models that control web browsers, desktop apps, or mobile interfaces. The promise is that these agents can interact with any software “like a human.” However, this research suggests that for many backend or data-heavy tasks, a CLI agent is not just faster but more reliable. The bottleneck is not the LLM’s reasoning; it is the pixel-to-action pipeline.

This has direct implications for deployment. If you are building an agent to manage cloud infrastructure, run database queries, or execute DevOps scripts, a CLI-first design is likely superior. Conversely, GUI agents remain essential for legacy enterprise software or consumer apps that lack programmatic access. The paper implies that the optimal architecture is not a single agent type, but a hybrid system that routes tasks to the appropriate interface.

Implications for AI Practitioners

Benchmark Hygiene: When evaluating agents, practitioners must control for interaction modality. A GUI agent that scores 90% on a web navigation task may be less capable than a CLI agent scoring 80% on a comparable command-line task, once you factor in execution time and error rate.
Latency Budgeting: The paper highlights that GUI actions (screenshot, parse, click, wait) can add seconds per step. For agents that require dozens of steps, this accumulates into minutes of overhead. Practitioners should model this latency when designing real-time systems.
Fallback Logic: The strongest systems will likely use CLI for structured, deterministic operations and GUI for unstructured, visual-only tasks. Building a router that detects interface availability could dramatically improve overall agent efficiency.

Key Takeaways

GUI agents suffer from measurable execution bottlenecks—slower speeds and higher error rates—compared to CLI agents on equivalent tasks, even when using the same underlying LLM.
Existing benchmarks may conflate interface efficiency with agent intelligence, leading to misleading performance comparisons.
For production systems, CLI-first architectures are preferable for speed and reliability, while GUI agents should be reserved for tasks that lack programmatic access.
Practitioners should design hybrid agents that dynamically select the optimal interaction modality based on the task and environment.

Read Original Article on Arxiv CS.AI

arxivpapersagents