Research2026-06-30

Predicting Effects, Missing Distributions: Evaluating LLMs as Human Behavior Simulators in Operations Management

Originally published byArxiv CS.AI

arXiv:2510.03310v2 Announce Type: replace-cross Abstract: Large language models (LLMs) are increasingly used to simulate human behavior in business, economics, and the social sciences, offering a low-cost complement to laboratory experiments, field studies, and surveys. This paper evaluates how...

The Simulation Gap: Why LLMs Struggle to Model Human Behavior in Operations

A new paper on arXiv (2510.03310v2) systematically evaluates how well large language models perform as human behavior simulators in operations management contexts. The research examines whether LLMs can reliably predict human decision-making in supply chain, inventory, and production settings — areas where behavioral biases like overordering, panic buying, and anchoring effects are well-documented.

The core finding is nuanced: LLMs can predict aggregate effects — such as the general direction of human bias — but they consistently fail to capture the distribution of human responses. In other words, an LLM might correctly guess that humans tend to overorder under uncertainty, but it cannot replicate the variance, skew, or individual heterogeneity that actual experiments reveal.

Why This Matters

This distinction between predicting effects versus distributions is critical for operations management. Many decisions — from safety stock levels to capacity planning — depend not just on average behavior but on the tails of the distribution. A simulation that gets the mean right but the variance wrong could lead to systematic under- or over-investment in buffers, creating fragile supply chains.

The paper’s methodology is particularly instructive. The authors compare LLM outputs against real human experimental data across multiple operations scenarios. They find that while LLMs can mimic certain heuristics, they lack the noise and contextual sensitivity that characterize actual human decision-making. This mirrors findings in behavioral economics: humans are not just biased but inconsistently biased, and LLMs struggle to reproduce that inconsistency.

Implications for AI Practitioners

For anyone deploying LLMs as synthetic respondents in business simulations or A/B testing, this research serves as a critical warning. Using LLMs as drop-in replacements for human subjects may produce results that are plausible but wrong — especially when the goal is to understand risk, variability, or rare events.

Practitioners should consider three specific limitations:

Distributional blindness: LLMs tend to produce outputs clustered around central tendencies, missing the fat tails that characterize real human behavior in operations.
Context fragility: The model’s performance degrades when scenarios involve domain-specific operational constraints (e.g., lead times, capacity limits) that are not well-represented in training data.
Calibration gaps: Without explicit calibration against real human data, LLM simulations may systematically underestimate behavioral noise, leading to overconfident predictions.

The paper does not argue that LLMs are useless for simulation — they remain valuable for generating hypotheses, piloting experimental designs, or exploring parameter spaces. But the findings underscore that LLMs should complement, not replace, human experiments, particularly when distributional accuracy matters.

Key Takeaways

LLMs can predict the direction of human behavioral biases in operations but fail to replicate the distribution of actual human responses, especially variance and tail events.
Using LLMs as synthetic subjects without calibration against real data risks producing systematically overconfident and under-dispersed simulations.
For operations management applications involving risk, inventory, or capacity decisions, LLM-based simulations should be treated as exploratory, not definitive.
Practitioners should combine LLM simulations with small-scale human validation studies to ensure distributional accuracy before relying on synthetic data for decision-making.

Read Original Article on Arxiv CS.AI

arxivpapers