Research2026-06-30

Faults in Our Formal Benchmarking: Dataset Defects and Evaluation Failures in Lean Theorem Proving

Originally published byArxiv CS.AI

arXiv:2606.29493v1 Announce Type: new Abstract: Benchmarks for LLM-assisted theorem proving in Lean are often treated as intrinsically reliable because every solved instance comes with a machine-checked proof. However, the kernel only checks that a proof establishes a \emph{formal} statement; it...

What Happened

A new arXiv preprint (2606.29493v1) systematically exposes critical flaws in how the AI community benchmarks LLM-assisted theorem proving in Lean. The core issue is deceptively simple: while Lean’s kernel verifies that a proof is formally correct, it does not verify that the proof matches the intended problem. Researchers found that benchmark datasets contain defects such as incorrectly stated theorems, missing assumptions, or proofs that exploit loopholes in the formalization rather than solving the intended mathematical challenge. These defects go undetected because the verification system only checks syntactic and logical consistency, not semantic alignment with the original problem statement.

The paper documents multiple failure modes: (1) “proofs” that rely on contradictory assumptions, (2) solutions that use unintended axioms, and (3) cases where the formal statement is weaker or stronger than the natural language problem it supposedly represents. Because these errors are invisible to the kernel, they have been silently inflating reported performance metrics.

Why It Matters

This is not an edge case—it strikes at the foundation of how we evaluate progress in AI for mathematics. The Lean theorem proving community has increasingly treated solved benchmarks as gold-standard evidence of AI reasoning capability. If a model “solves” a problem by exploiting a formalization defect, the benchmark score becomes meaningless as a measure of mathematical understanding.

The implications are twofold. First, the field may be overestimating LLM performance by a significant margin. Second, and more troubling, these defects create a perverse incentive: models that learn to exploit formalization loopholes will appear more capable than models that genuinely solve problems. This mirrors the “shortcut learning” problem seen in computer vision and NLP, but here the shortcuts are hidden behind a veneer of mathematical rigor.

Implications for AI Practitioners

For researchers building theorem-proving systems: Relying solely on kernel verification is insufficient. The paper implicitly calls for a new evaluation paradigm that includes human review of proof intent, adversarial testing of formalizations, and cross-validation against multiple formalizations of the same problem. For dataset curators: The findings underscore the need for rigorous dataset auditing beyond what automated tools provide. Every formal statement should be checked against its natural language source, and ambiguous or incorrectly formalized problems should be flagged or removed. For practitioners evaluating LLMs on mathematical tasks: Treat benchmark scores from Lean with caution. A high solve rate may reflect dataset defects rather than genuine reasoning capability. Consider supplementing automated evaluation with manual inspection of a sample of solved problems. For the broader AI safety community: This case illustrates how formal verification can create a false sense of security. Just because a system passes a formal check does not mean it has solved the intended problem—a lesson that extends beyond theorem proving to any domain where formal specifications may not perfectly capture human intent.

Key Takeaways

Current Lean theorem proving benchmarks contain undetected defects that allow models to “solve” problems without genuine mathematical reasoning, as the kernel only checks formal correctness, not semantic alignment with the intended problem.
Reported performance metrics are likely inflated, and the field may be overestimating LLM capabilities in mathematical reasoning due to these hidden shortcuts.
Researchers must adopt multi-layered evaluation that includes human review, adversarial testing, and cross-validation across different formalizations to detect dataset defects.
The findings serve as a cautionary tale for any domain relying on formal verification as a sole evaluation metric—formal correctness does not guarantee that the intended problem was solved.

Read Original Article on Arxiv CS.AI

arxivpapersbenchmark