Research2026-06-19

The Autonomy Tax: Defense Training Breaks LLM Agents

arXiv:2603.19423v2 Announce Type: replace-cross Abstract: Large language model (LLM) agents increasingly rely on external tools (file operations, API calls, database transactions) to autonomously complete complex multi-step tasks. Practitioners deploy defense-trained models to protect against...

The Hidden Cost of Safety: How Defense Training Breaks LLM Agent Functionality

A new preprint from Arxiv (arXiv:2603.19423v2) reveals a troubling paradox in the development of autonomous LLM agents: the very safety measures designed to protect these systems may be undermining their core functionality. The research, which examines defense-trained models used for tool-based tasks like file operations, API calls, and database transactions, identifies what the authors term an "autonomy tax"—a measurable degradation in agent performance when safety guardrails are applied.

The core finding is straightforward but consequential. When LLMs are fine-tuned to reject harmful instructions or avoid risky behaviors (defense training), they become overly cautious. They begin rejecting legitimate, benign tool calls that resemble potentially dangerous ones. A model trained to never delete files might refuse to clean up temporary directories. An agent taught to avoid unauthorized API calls might block a user's own database query. The defense mechanisms, designed for narrow safety scenarios, generalize too broadly, crippling the agent's ability to execute even routine multi-step workflows.

This is not merely a theoretical inconvenience. For practitioners deploying LLM agents in production—think automated code review pipelines, customer support systems that update records, or research assistants that manage data—the autonomy tax translates directly into failed tasks, increased error handling, and frustrated users. The research suggests that the trade-off between safety and capability is steeper than previously acknowledged, particularly for agents that must interact with dynamic, permission-based environments.

Why does this matter now? Because the industry is racing to deploy autonomous agents in high-stakes domains—healthcare records management, financial trading, infrastructure monitoring—where both safety and reliability are non-negotiable. If defense training breaks agent functionality, organizations face an unpalatable choice: deploy unsafe models or deploy models that cannot do their jobs.

The implications for AI practitioners are immediate. First, defense training cannot be treated as a one-size-fits-all solution. It requires careful calibration, perhaps using adversarial testing that includes benign tool calls to prevent overgeneralization. Second, agent architectures may need to separate the safety layer from the reasoning layer, allowing a policy engine to evaluate tool calls without corrupting the model's core capabilities. Third, the research underscores the need for better benchmarks—current safety evaluations often measure refusal rates on harmful prompts but ignore false positives on legitimate tasks.

Key Takeaways

Defense training introduces an "autonomy tax" where safety fine-tuning causes LLM agents to reject legitimate tool calls, degrading real-world task completion rates.
The trade-off between safety and capability is steeper than expected, forcing practitioners to choose between secure but broken agents or functional but vulnerable ones.
Separating safety evaluation from core reasoning may be necessary to preserve agent functionality while maintaining guardrails.
Current safety benchmarks are insufficient—they fail to measure false positive rates on benign tasks, leaving a critical gap in deployment readiness.

Read Original Article on Arxiv CS.AI

arxivpapersagents