Research2026-06-19

Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining

arXiv:2606.20363v1 Announce Type: new Abstract: Explicit skill libraries make computer-using agents easier to inspect, but it remains unclear whether such libraries can be mined from interaction data in a way that improves downstream policies. We study this question through a three-stage pipeline...

What Happened

A new arXiv preprint (2606.20363v1) proposes a three-stage pipeline for automatically generating SKILL.md files—structured documentation of agent capabilities—from interaction trajectories. The core idea is that as computer-using agents (CUAs) perform tasks, they leave behind traces of their actions, observations, and decision points. By mining these trajectories, the system extracts reusable skill definitions that can be compiled into a machine-readable skill library. The pipeline likely involves trajectory segmentation, skill abstraction, and validation against downstream task performance.

This is a direct response to a persistent problem: while explicit skill libraries make agent behavior more interpretable and debuggable, manually curating them is labor-intensive and brittle. The researchers ask whether such libraries can be discovered rather than designed, and whether doing so actually improves policy learning.

Why It Matters

The significance lies in bridging two competing paradigms in agent development. On one hand, end-to-end learned policies (e.g., large language models trained on demonstrations) are powerful but opaque—you cannot easily inspect why an agent clicked a specific button. On the other hand, explicit skill libraries offer transparency but require expert curation that doesn't scale.

If this approach works, it offers a middle path: skill libraries that emerge organically from interaction data, preserving inspectability without sacrificing automation. For CUAs—which must navigate complex, real-world interfaces—this could accelerate deployment in regulated environments where auditability is mandatory (finance, healthcare, enterprise software).

The paper's framing is also notable for its honesty: it explicitly asks whether mined skills improve downstream policies, not just whether they can be generated. This shifts the evaluation from "can we extract patterns?" to "do these patterns actually help agents perform better?"—a more rigorous standard than much prior work on skill discovery.

Implications for AI Practitioners

For agent builders: If validated, this pipeline could reduce the engineering burden of maintaining skill libraries. Instead of manually defining "login to CRM" or "extract invoice data," you let the agent's own interactions reveal what skills are actually useful in your specific environment. This is particularly valuable for domain-specific CUAs where off-the-shelf skill taxonomies don't apply. For MLOps teams: The three-stage pipeline introduces a new artifact into the agent development lifecycle: the auto-generated skill library. Teams will need tooling to review, version, and potentially override mined skills (since automated extraction may produce false positives or unsafe patterns). This mirrors the evolution of feature stores in traditional ML—moving from manual feature engineering to automated feature discovery. For researchers: The paper implicitly raises a question that deserves more attention: how do we measure the quality of a mined skill? Downstream policy improvement is one metric, but skill libraries also serve interpretability, transfer learning, and safety verification. A skill that boosts performance but is incomprehensible to humans may defeat the purpose of explicit libraries.

Key Takeaways

A new pipeline proposes automated generation of SKILL.md files from agent interaction trajectories, aiming to combine the benefits of explicit skill libraries with the scalability of data-driven discovery.
The approach directly addresses the tension between agent transparency and automation, with potential applications in regulated industries where auditability is critical.
Practitioners should watch for validation results on whether mined skills actually improve downstream policies—the paper's stated evaluation criterion—rather than just producing plausible-looking libraries.
If successful, this could shift agent development workflows toward continuous skill mining, requiring new tooling for review, versioning, and safety checks on auto-generated capabilities.

Read Original Article on Arxiv CS.AI

arxivpapersagents