Research2026-06-30

When More Sampling Hurts: The Modal Ceiling and Correlation Ceiling of Test-Time Scaling

Originally published byArxiv CS.AI

arXiv:2606.28661v1 Announce Type: cross Abstract: People overthink; language models over-sample, and the extra effort can talk both into a worse answer. Reasoning systems answer a hard question by sampling it many times (test-time scaling), and the more they draw, the more often a correct answer...

The Paradox of Plenty: When More Sampling Degrades AI Performance

The new arXiv preprint (2606.28661v1) presents a counterintuitive finding that challenges a core assumption in modern AI deployment: that generating more candidate answers through test-time scaling will monotonically improve performance. Instead, the authors identify two distinct failure modes—the modal ceiling and the correlation ceiling—where additional sampling actively degrades results.

What the research reveals

The phenomenon mirrors human overthinking: just as a person can talk themselves out of a correct first instinct, language models can "oversample" themselves into worse answers. The modal ceiling occurs when the most frequent answer among samples becomes less reliable as sample size grows, because the model's own noise patterns create spurious majorities. The correlation ceiling emerges when repeated samples become increasingly correlated, diminishing the marginal benefit of each new draw and eventually introducing systematic bias.

This is not a trivial edge case. The paper demonstrates that for certain reasoning tasks, the optimal number of samples exists well before computational exhaustion—and crossing that threshold actively harms accuracy.

Why this matters

The finding strikes at the heart of current best practices for deploying reasoning systems. Many production pipelines rely on majority voting or self-consistency decoding, assuming that "more is better." This research suggests that such approaches may be silently degrading performance on precisely the hardest problems where sampling is most aggressively applied.

The implications are threefold. First, compute budgets may be misallocated: teams spending GPU cycles on excessive sampling could achieve better results with fewer, more carefully selected samples. Second, evaluation protocols that report performance at fixed sample sizes may miss this degradation curve entirely, creating misleading benchmarks. Third, the finding suggests that the relationship between sampling and accuracy is non-monotonic—a property that existing scaling laws do not account for.

Implications for AI practitioners

For engineers deploying reasoning systems, the immediate takeaway is to profile the sample-to-accuracy curve for their specific tasks rather than assuming monotonic improvement. This means running controlled experiments with varying sample counts and identifying the inflection point where additional sampling becomes harmful.

The research also points toward more sophisticated aggregation methods. Rather than simple majority voting, practitioners might explore weighted voting schemes that account for sample diversity, or dynamic stopping criteria that halt sampling when the correlation ceiling approaches.

For researchers, the paper opens a new axis of investigation: understanding the conditions under which test-time scaling fails. The modal and correlation ceilings may be symptoms of deeper issues in how language models represent uncertainty and generate diverse outputs.

Key Takeaways

More sampling can actively harm performance on reasoning tasks due to modal and correlation ceilings, challenging the assumption that test-time scaling is monotonic
Optimal sample counts exist and are task-dependent — practitioners should profile the accuracy curve for their specific use case rather than defaulting to maximum sampling
Current evaluation practices may be misleading if they report performance at a single sample size without checking for degradation at higher counts
Better aggregation methods are needed — simple majority voting may be suboptimal; dynamic stopping and diversity-aware weighting warrant investigation

Read Original Article on Arxiv CS.AI

arxivpapers