Research2026-06-26

Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models

arXiv:2606.26101v1 Announce Type: cross Abstract: Reliable evaluation of large language models should separate supported answering from unsupported guessing without conflating either with data contamination, prompt idiosyncrasy, or generic refusal behavior. We present a contamination-aware,...

What Happened

A new research paper introduces Know2Guess, a benchmark designed to rigorously evaluate whether large language models genuinely know an answer or are merely guessing — while explicitly controlling for data contamination, prompt sensitivity, and refusal patterns. The approach divides evaluation into multiple "zones" that separate supported knowledge from unsupported speculation, addressing a long-standing blind spot in LLM assessment.

The core innovation is contamination-awareness: many existing benchmarks inadvertently reward models for memorizing training data rather than demonstrating reasoning or factual recall. Know2Guess constructs test items where the model must distinguish between information it was trained on (and thus could recall) versus information it must infer or extrapolate. By creating multi-zone splits, the benchmark can attribute performance to genuine knowledge versus lucky guesses or memorized patterns.

Why It Matters

Current evaluation practices conflate several distinct failure modes. A model that answers correctly might be: (1) genuinely knowledgeable, (2) guessing based on statistical patterns, (3) repeating contaminated training data, or (4) responding to prompt artifacts. Conversely, a wrong answer could stem from knowledge gaps, prompt misinterpretation, or overly cautious refusal behavior. Know2Guess aims to disentangle these.

This matters because deployment decisions hinge on knowing what a model actually knows. In high-stakes domains like medicine, law, or finance, a model that guesses confidently but incorrectly is far more dangerous than one that explicitly refuses. Current benchmarks cannot reliably distinguish these cases — a model that scores 90% on a multiple-choice test might be guessing on 30% of those correct answers.

The contamination-aware design is particularly timely. As training datasets grow larger and more opaque, the line between legitimate knowledge and data leakage blurs. Know2Guess provides a methodology to detect when a model's performance is inflated by exposure to test-like examples during training.

Implications for AI Practitioners

For evaluation teams: This benchmark offers a more granular diagnostic tool. Instead of a single accuracy score, practitioners can now measure knowledge confidence — the probability that a correct answer reflects genuine understanding versus statistical luck. This enables better risk assessment for deployment. For model developers: The multi-zone approach reveals where models rely on memorization versus reasoning. If a model performs well on contaminated zones but poorly on uncontaminated ones, it signals overfitting to training data rather than robust capability. This can guide data curation and training strategies. For safety researchers: Know2Guess's refusal-aware design helps distinguish between appropriate uncertainty and excessive caution. Models that refuse too often (or too rarely) can be calibrated more precisely when the benchmark isolates refusal behavior from knowledge gaps. A caveat: The benchmark's complexity means it may not replace simpler evaluations for rapid iteration. Practitioners should view it as a diagnostic layer — applied after basic accuracy checks, to validate that high scores reflect genuine capability.

Key Takeaways

Know2Guess introduces contamination-aware, multi-zone evaluation that separates genuine knowledge from guessing, memorization, and refusal artifacts.
This addresses a critical gap: current benchmarks cannot reliably distinguish whether a correct answer reflects understanding or statistical luck.
For practitioners, the benchmark enables more precise risk assessment, especially in high-stakes deployments where false confidence is dangerous.
The approach is best used as a diagnostic layer after basic accuracy checks, not as a replacement for simpler evaluations during rapid iteration.

Read Original Article on Arxiv CS.AI

arxivpapersbenchmark