Research2026-07-03

The Token Not Taken: Sampling, State, and the Stochasticity of AI Agents

Originally published byArxiv CS.AI

arXiv:2606.08998v3 Announce Type: replace Abstract: Agentic AI systems can behave differently across runs: the same request may produce a different plan, a different tool call, a different code edit, or a different final answer. Such variability arises from several layers that are often conflated....

The Unseen Instability of Agentic Systems

The Arxiv paper "The Token Not Taken" tackles a fundamental yet often overlooked problem in agentic AI: non-determinism. While large language models have always exhibited stochasticity in text generation, the paper argues that this variability becomes structurally amplified when models are embedded in agentic loops—where a single token choice can cascade into entirely different tool calls, code edits, or plans. The authors systematically disentangle the multiple layers of randomness: sampling temperature, seed settings, system state, and the inherent unpredictability of external tool outputs.

Why This Matters

For practitioners, this is not an academic curiosity. The core insight is that agentic systems are not merely "unreliable" in the way a stochastic parrot might be; they are structurally unstable. A 0.1 difference in temperature can cause an agent to call a read API instead of a write API, or to pursue an entirely different subgoal. The paper’s key contribution is showing that these failure modes are not bugs but features of the architecture—the same request, given identical inputs, can produce divergent outcomes because the agent’s state (the sequence of tokens and tool outputs) is path-dependent.

This has immediate practical consequences. In production environments, developers often assume that setting a fixed seed and temperature=0 will guarantee reproducibility. The paper demonstrates that this assumption breaks down when agents interact with external systems (databases, APIs, filesystems) that have their own non-determinism. Moreover, the agent’s own internal reasoning chain can bifurcate due to floating-point rounding in logit sampling, creating what the authors call "stochastic forks."

Implications for AI Practitioners

First, testing must account for distributional behavior, not single outputs. Current evaluation practices—running one or two examples and checking for correctness—are insufficient. Practitioners need to run the same prompt multiple times with different seeds and temperatures to characterize the range of possible behaviors.

Second, state management becomes a first-class concern. The paper implies that agentic systems should log not just the final answer but the full trajectory of tokens, tool calls, and internal states. This is necessary for debugging and for implementing rollback mechanisms when an agent diverges from expected behavior.

Third, temperature tuning takes on new importance. The common practice of setting temperature to 0 for "deterministic" agents is shown to be a false comfort. The paper suggests that practitioners should instead embrace controlled stochasticity—for example, using low but non-zero temperatures with explicit retry logic, rather than pretending determinism is achievable.

Key Takeaways

Agentic AI systems exhibit structural non-determinism that goes far beyond simple text generation randomness, creating cascading divergences in tool calls and plans.
Fixed seeds and zero temperature do not guarantee reproducibility when agents interact with external systems or have complex internal state.
Practitioners must shift from single-output testing to distributional testing, running multiple trajectories to characterize system behavior.
State logging and controlled stochasticity (with retry mechanisms) are more effective than attempting to enforce false determinism in agentic architectures.

Read Original Article on Arxiv CS.AI

arxivpapersagents