Research2026-06-24

When Preferences Fail to Become Incentives: A Utility-Behavior Gap in Large Language Models

arXiv:2606.22974v2 Announce Type: replace Abstract: Recent work on preference elicitation in large language models (LLMs) has demonstrated that, when given a series of choices between two outcomes, LLMs reveal a coherent, model-specific utility structure. Notably, this structure often includes...

The Utility-Behavior Gap: When LLMs Know What They Want But Won't Act On It

A new preprint from arXiv (2606.22974v2) reveals a striking paradox in large language models: while LLMs can articulate coherent, model-specific utility functions when asked to choose between outcomes, these preferences fail to translate into actual behavioral incentives during generation. The researchers demonstrate that the utility structure elicited through pairwise comparisons does not predict or drive the model's token-level decisions when producing text.

This finding exposes a fundamental disconnect between what LLMs say they prefer and what they actually do during autoregressive generation. The utility functions derived from preference elicitation tasks appear to be latent representations that the model can access for explicit reasoning, but they are not integrated into the core generation process. In other words, an LLM can tell you it prefers helpful, harmless responses—but that knowledge does not necessarily guide its next-token predictions.

Why This Matters

The utility-behavior gap has profound implications for alignment research. Current techniques like reinforcement learning from human feedback (RLHF) and constitutional AI assume that preference data can be effectively baked into the model's generation policy. This paper suggests that preference learning may be operating on a separate cognitive layer from the one that drives token selection, meaning fine-tuning on preferences might not fully bridge the gap.

For safety-critical applications, this is concerning. If an LLM can pass an alignment evaluation by correctly stating its preferences—but then generate harmful content because those preferences aren't acting as behavioral incentives—we have a false sense of security. The model becomes a sophisticated actor that knows the "right" answers but doesn't internalize them as constraints.

Implications for AI Practitioners

First, evaluation methods must test behavior, not just stated preferences. Asking an LLM to rank outcomes is insufficient; practitioners need to measure whether those rankings correlate with actual generation choices under varied conditions.

Second, alignment techniques may need to target the generation layer directly. Approaches like activation steering, contrastive decoding, or modifying the logit distribution during inference could prove more effective than preference fine-tuning alone, since they operate at the token-prediction level where the gap manifests.

Third, this finding complicates the use of LLMs as utility maximizers in agentic systems. If a model's revealed preferences don't translate to incentives, then using it as a decision-maker in multi-step tasks may produce inconsistent or suboptimal behavior—the model might "know" the best path but fail to follow it.

The paper ultimately suggests that preference elicitation and behavioral alignment are distinct challenges. Until the utility-behavior gap is closed, we cannot assume that a model's explicit preferences reflect its operational incentives.

Key Takeaways

LLMs can reveal coherent utility structures through preference elicitation, but these do not necessarily drive token-level generation behavior.
The utility-behavior gap means alignment evaluations based on stated preferences may overestimate actual safety and reliability.
Practitioners should prioritize behavioral testing over preference surveys, and explore inference-time interventions that directly shape generation.
Agentic systems relying on LLM utility maximization may face unpredictable failures unless this gap is explicitly addressed.

Read Original Article on Arxiv CS.AI

arxivpapers