Research2026-06-29

Mitigating LLM-based p-Hacking by Preregistering for the Next LLM

Originally published byArxiv CS.AI

arXiv:2606.27687v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used to generate, classify, and annotate data whose outputs feed downstream hypothesis tests. However, LLM-based research is easy to p-hack: a researcher can tune the prompts, decoding parameters, or...

The Problem of p-Hacking in LLM Research

A new preprint from arXiv (2606.27687v1) tackles a growing methodological concern: the ease with which researchers can manipulate large language model outputs to produce statistically significant results. The paper proposes a solution—preregistering the specific LLM and its parameters before conducting analyses—to prevent what the authors term "LLM-based p-hacking."

The core issue is straightforward. When researchers use LLMs to generate, classify, or annotate data for hypothesis testing, they have numerous degrees of freedom. They can tweak prompts, adjust decoding parameters like temperature or top-k sampling, change the model version, or alter the instruction formatting. Each minor adjustment can shift the distribution of outputs, and with enough iterations, a researcher can unconsciously (or consciously) find a configuration that yields a p-value below the 0.05 threshold. This is classic p-hacking, but supercharged by the flexibility of LLMs.

Why This Matters

This problem is not hypothetical. LLMs are now embedded in many research pipelines—from social science surveys where models simulate respondents, to biomedical studies where they classify clinical notes, to computational linguistics where they generate training data. If the underlying LLM output is treated as a fixed, objective measurement rather than a stochastic process, the resulting statistical inferences become unreliable.

The paper's proposed solution—preregistration of the exact LLM, its version, decoding parameters, and prompt template—mirrors the preregistration movement in psychology and other empirical sciences. The logic is sound: by committing to a specific model configuration before seeing the results, researchers eliminate the opportunity for post-hoc optimization. This is particularly important because LLM outputs are not deterministic in the way traditional statistical software outputs are. Changing the random seed alone can flip a result from significant to non-significant.

Implications for AI Practitioners

For researchers using LLMs in their workflows, this paper offers both a warning and a practical remedy. The warning is that standard statistical safeguards (like multiple comparison corrections) are insufficient if the data generation process itself is being tuned. The remedy is to treat the LLM configuration as a fixed part of the experimental design, not as a tunable parameter.

For AI developers building tools for research, this suggests a design principle: make it easy for users to specify and lock model parameters, and hard to change them after seeing results. Platforms like Claude, GPT, and open-source models should consider offering "research mode" features that log and enforce parameter consistency.

The broader implication is that as LLMs become research instruments, they need the same methodological rigor we apply to physical instruments. You wouldn't recalibrate a spectrometer until you got the result you wanted—the same discipline should apply to prompt engineering.

Key Takeaways

LLM-based p-hacking is a real methodological risk because researchers can tune prompts and parameters to achieve desired statistical significance
Preregistering the exact LLM configuration (model version, parameters, prompt template) before data collection can prevent this form of researcher bias
AI practitioners should treat LLM configurations as fixed experimental parameters, not as variables to optimize post-hoc
Platforms should consider building features that enforce parameter consistency and logging for research use cases

Read Original Article on Arxiv CS.AI

arxivpapers