ConflictScore: Identifying and Measuring How Language Models Handle Conflicting Evidence
arXiv:2606.26437v1 Announce Type: cross Abstract: Existing metrics for factuality and faithfulness evaluate whether an answer is supported or contradicted by its grounding documents, but they fail to capture when both supporting and contradicting evidence coexist. We introduce ConflictScore, a...
The Blind Spot in Factuality Metrics
A new paper from arXiv introduces ConflictScore, a metric designed to address a critical gap in how we evaluate language model outputs: the inability to detect when evidence simultaneously supports and contradicts a given claim. Current factuality and faithfulness metrics treat evidence as binary—either supporting or contradicting—but real-world information retrieval often returns documents with genuine internal conflict. ConflictScore fills this void by measuring how well models navigate mixed evidence scenarios.
What ConflictScore Actually Measures
The metric quantifies a model’s sensitivity to conflicting evidence by comparing its confidence or output distribution when presented with purely supportive documents versus mixed supportive and contradictory documents. A high ConflictScore indicates the model appropriately adjusts its behavior when contradictions exist, while a low score suggests it ignores or fails to integrate conflicting signals. This moves beyond simple precision-recall evaluation into the more nuanced territory of epistemic awareness—whether the model “knows” when it should be uncertain.
Why This Matters Now
The timing is significant. As LLMs are deployed in high-stakes domains like legal research, medical diagnosis, and financial analysis, the ability to handle contradictory sources becomes paramount. A legal assistant that cites both a precedent and its subsequent overturning without flagging the conflict is not just unhelpful—it’s dangerous. Current RAG (Retrieval-Augmented Generation) pipelines often assume retrieved documents are internally consistent, which is rarely true in practice. ConflictScore exposes this fragility.
For AI practitioners, this has immediate implications:
- Evaluation pipelines need upgrading. Standard metrics like F1 or BERTScore won’t catch evidence conflicts. Teams building retrieval systems should incorporate ConflictScore or similar conflict-aware evaluations into their testing regimes.
- Prompt engineering must account for ambiguity. Simply instructing models to “use the provided context” is insufficient. Explicit prompts asking “Does the evidence support, contradict, or conflict on this point?” may be necessary.
- Confidence calibration becomes actionable. ConflictScore provides a concrete way to measure whether a model’s uncertainty actually reflects the ambiguity in its source material, rather than just model confidence scores.
The Deeper Challenge
ConflictScore highlights a broader issue: current LLMs are trained to produce coherent, confident outputs, but the real world is messy and contradictory. Teaching models to express uncertainty or flag conflicts is a fundamentally different capability from factual recall. This metric is a step toward making models more honest about what they don’t know—or what the evidence itself cannot definitively resolve.
Key Takeaways
- ConflictScore addresses a blind spot in existing factuality metrics by measuring how models handle genuinely conflicting evidence, not just false or unsupported claims.
- Practitioners should integrate conflict-aware evaluation into RAG pipelines, especially for high-stakes applications where contradictory sources are common.
- The metric provides a practical tool for measuring model uncertainty calibration against real-world evidence ambiguity, not just internal confidence scores.
- This work signals a shift from evaluating factual correctness toward evaluating epistemic honesty—whether models can appropriately express uncertainty when evidence is mixed.