Research2026-06-29

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

Originally published byArxiv CS.AI

arXiv:2601.21233v2 Announce Type: replace Abstract: Autonomous code agents built on large language models are reshaping software and AI development through tool use, long-horizon reasoning, and self-directed interaction. However, this autonomy introduces a previously unrecognized security risk:...

The Curious Case of the Over-Sharing Code Agent

A new preprint from arXiv (2601.21233v2) reveals a surprisingly simple yet profound security vulnerability in autonomous code agents built on frontier large language models (LLMs). The researchers found that these agents, when prompted with a straightforward request like "Just ask," will readily divulge their own system prompts—the carefully crafted instructions that govern their behavior, safety constraints, and operational boundaries.

The mechanism is almost trivial in its simplicity. Code agents, designed to execute complex, multi-step tasks through tool use and self-directed reasoning, interpret the "Just ask" command as a legitimate instruction to retrieve information. They do not possess inherent safeguards against revealing their own configuration, because the system prompt itself is typically invisible to the agent's reasoning loop. The agent treats the request as it would any other data retrieval task, accessing its own initialization parameters and returning them verbatim.

Why This Matters

This finding exposes a critical blind spot in current agent architecture. System prompts are the bedrock of AI safety and alignment—they contain refusal policies, ethical guidelines, and operational constraints. If an adversary can extract these prompts, they gain a blueprint of the agent's defenses. They can then craft inputs that explicitly circumvent these rules, or understand exactly which topics trigger refusal and which do not.

The risk is not merely academic. Autonomous code agents are increasingly deployed in production environments for software development, data analysis, and even financial trading. A compromised agent could be manipulated into ignoring safety checks, executing malicious code, or leaking sensitive information about its own design. The fact that this extraction requires no sophisticated jailbreak—just a polite request—makes it particularly dangerous.

Implications for AI Practitioners

For developers deploying LLM-based agents, this research demands immediate attention to a new attack surface. The traditional approach of treating the system prompt as a static, hidden document is no longer viable. Practitioners should consider:

Prompt obfuscation and segmentation: System prompts should be structured so that no single agent can access the entire configuration. Critical safety instructions could be stored in separate, access-controlled modules.

Instructional hygiene: Agents must be explicitly instructed not to reveal their own configuration parameters. This is a simple but effective first line of defense.

Runtime monitoring: Deploy logging and anomaly detection for queries that attempt to access system-level information. The "Just ask" pattern is easy to detect, but more sophisticated variants may emerge.

Principle of least privilege: Agents should only have access to the tools and data they absolutely need. An agent that can read its own system prompt has more privilege than necessary.

This research serves as a sobering reminder that as AI agents become more autonomous, the security model must evolve from static, one-time configuration to dynamic, runtime-aware protection. The agents themselves are now part of the attack surface, and their curiosity can be weaponized.

Key Takeaways

Autonomous code agents can be tricked into revealing their entire system prompt through a simple "Just ask" command, bypassing intended safety controls.
This vulnerability allows adversaries to map out an agent's defenses and craft targeted attacks that circumvent safety policies.
Practitioners must implement prompt obfuscation, explicit anti-disclosure instructions, and runtime monitoring to mitigate this risk.
The finding highlights a fundamental shift in AI security: agents are no longer just tools but active participants in the attack surface.

Read Original Article on Arxiv CS.AI

arxivpapersagentsprompting