Research2026-06-26

Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

arXiv:2606.27226v1 Announce Type: new Abstract: Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores that are hard to...

A Simpler Yardstick: Why Binary Questions Could Fix LLM Evaluation

The research community has long grappled with a fundamental problem: how do you reliably measure whether an LLM output is "good"? The paper "Ask, Don't Judge" proposes a deceptively simple solution—replace holistic scoring with a series of binary, factual questions. Instead of asking a judge model to rate an answer on a scale of 1-5, the method decomposes evaluation into discrete yes/no queries about specific properties of the output.

This approach directly attacks three known weaknesses in current evaluation pipelines. First, human evaluation is prohibitively expensive for iterative development. Second, lexical metrics like BLEU or ROUGE fail to capture semantic quality in open-ended tasks. Third, and most critically, LLM-as-judge methods produce opaque scores that are difficult to interpret, debug, or trust. A score of 3.7 out of 5 tells a practitioner little about why the output fell short. A binary question—"Does the response contain a factual error?"—provides immediate, actionable feedback.

Why This Matters for AI Practitioners

The practical implications are significant. Binary evaluation creates a natural feedback loop for self-improvement. If an LLM can answer its own binary questions about its outputs, it can identify specific failure modes—hallucination, missing steps, logical inconsistency—and correct them without human intervention. This moves beyond simple RLHF reward models toward interpretable, component-based quality control.

For teams building production systems, this method offers three concrete advantages. First, reproducibility: binary questions yield deterministic results across runs, unlike scalar judgments that vary with prompt phrasing. Second, debuggability: when an evaluation fails, the specific question that triggered the failure pinpoints the exact issue. Third, cost efficiency: binary classification requires far less compute than generating nuanced textual critiques, making it viable for continuous monitoring.

The approach also addresses a growing concern about evaluation reliability. Recent work has shown that LLM judges are inconsistent—changing the order of options or minor prompt variations can flip scores. Binary questions reduce this fragility by constraining the judge to a narrow, verifiable task. The trade-off is that designing good binary questions requires upfront effort and domain knowledge, but this investment pays dividends in evaluation transparency.

Key Takeaways

Replace opaque scores with diagnostic questions: Binary evaluation transforms LLM assessment from a black-box rating into a transparent checklist of specific quality attributes.
Enables automated self-correction: By identifying exact failure points through binary queries, LLMs can iteratively improve their own outputs without human feedback.
Improves evaluation reliability: Binary classification reduces the fragility and inconsistency seen in holistic LLM judges, producing more reproducible results.
Practical for production systems: The approach offers cost-effective, interpretable, and debuggable evaluation suitable for continuous monitoring and quality assurance pipelines.

Read Original Article on Arxiv CS.AI

arxivpapers