Research2026-07-02

DigitalCoach: Communication and Grounding Gaps in Human and Agentic Computer Use Coaching

Originally published byArxiv CS.AI

arXiv:2606.31980v1 Announce Type: cross Abstract: Agents are increasingly capable of automating software tasks, but can they teach humans how to use software themselves? We introduce DigitalCoach, a multimodal dataset of 72 human expert-novice computer use coaching sessions consisting of 22,752...

What Happened

Researchers have released DigitalCoach, a multimodal dataset capturing 72 human expert-novice coaching sessions for software tasks, comprising 22,752 annotated interactions. The dataset systematically documents how human experts guide novices through computer use—including verbal instructions, screen recordings, and grounding behaviors like pointing, highlighting, and verifying comprehension. Crucially, the paper identifies specific "grounding gaps" where miscommunication occurs: experts assume shared knowledge novices lack, or novices misinterpret vague instructions. These gaps are precisely the failure points that current AI agents exhibit when attempting to teach software skills.

Why It Matters

This research addresses a blind spot in the AI industry's current trajectory. While large language models and agentic systems have made impressive strides in performing software tasks autonomously—clicking buttons, filling forms, navigating interfaces—the ability to teach humans to perform those same tasks remains underdeveloped. The distinction is critical: automation replaces human action, while coaching augments human capability.

The grounding gaps documented in DigitalCoach mirror the exact problems that arise when AI assistants like Claude or GPT attempt to provide software tutorials. Current models often produce overly generic instructions, fail to detect user confusion, and cannot dynamically adjust explanations based on real-time feedback. By providing a structured dataset of how humans successfully bridge these gaps, DigitalCoach offers a blueprint for training AI systems to recognize when a user is lost, when an instruction needs simplification, and when to confirm understanding before proceeding.

For enterprise deployments, this matters enormously. Organizations investing in AI copilots and digital assistants are discovering that autonomous task completion often creates new problems: users become dependent on automation without developing underlying skills, leading to brittleness when the AI fails or the interface changes. Coaching-capable agents could shift this dynamic, building user competence rather than bypassing it.

Implications for AI Practitioners

First, training data for coaching is fundamentally different from task-completion data. DigitalCoach's value lies not in demonstrating successful task outcomes but in capturing the process of instruction—the back-and-forth clarifications, the non-verbal cues, the moments of confusion and resolution. Practitioners building teaching agents should prioritize collecting similar interaction logs rather than relying solely on static documentation or task demonstrations.

Second, grounding detection should become a core model capability. The paper's identification of specific grounding gaps suggests that models need explicit mechanisms to detect when shared context has broken down. This could involve training classifiers on the DigitalCoach data to recognize linguistic and visual signals of confusion, or incorporating explicit "check for understanding" loops into agent architectures.

Third, multimodal alignment is not optional. The dataset's combination of screen recordings, verbal instructions, and action logs underscores that effective coaching requires integrating visual context with natural language. Practitioners should invest in vision-language models that can attend to specific UI elements while generating instructions, rather than treating screen understanding as a separate pipeline.

Key Takeaways

DigitalCoach provides the first large-scale, annotated dataset of human-to-human software coaching, explicitly cataloging communication breakdowns that AI teaching agents must overcome.
The research highlights a critical gap: current AI agents excel at task execution but struggle with the fundamentally different skill of teaching, which requires real-time grounding and adaptive explanation.
For AI practitioners, the dataset offers a training resource for building models that can detect user confusion, adjust instruction granularity, and verify comprehension—capabilities absent from most current agent systems.
Enterprise adoption of AI copilots may be limited without coaching capabilities, as automation without skill transfer creates dependency rather than empowerment.

Read Original Article on Arxiv CS.AI

arxivpapersagents