Discovering New Theorems via LLMs with In-Context Proof Learning in Lean
arXiv:2509.14274v4 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have demonstrated significant promise in formal theorem proving. In this study, we investigate the ability of LLMs to discover novel theorems and produce verified proofs. We propose a pipeline called...
What Happened
A new arXiv preprint (2509.14274) presents a pipeline that leverages large language models to not just prove existing theorems, but to discover novel theorems and generate verified proofs within the Lean formal proof assistant. The core innovation is "In-Context Proof Learning" — a method where an LLM is given examples of theorem statements and their formal proofs, then prompted to propose new, nontrivial theorems and prove them using Lean’s verification engine. The pipeline iteratively refines candidate theorems, discarding those that fail verification or are trivial, and retaining only genuinely novel, provable results.
This moves beyond the well-trodden path of using LLMs as proof assistants for known conjectures. Instead, the model acts as an autonomous mathematician: generating hypotheses, attempting formal proofs, and filtering for originality. The Lean environment provides a rigorous check, ensuring that any claimed theorem is mathematically sound.
Why It Matters
This research addresses a fundamental bottleneck in automated mathematics: the gap between generating plausible mathematical statements and verifying their correctness. Prior work has shown LLMs can suggest conjectures, but without formal verification, those suggestions remain speculative. By embedding the LLM in a proof-checking loop, the pipeline guarantees that every output is a genuine theorem.
For the broader AI community, this is a significant step toward autonomous scientific discovery. The approach is not limited to mathematics — similar pipelines could be applied to program synthesis (generating verified code), formal verification of hardware, or even legal reasoning, where correctness must be certified. The key insight is that LLMs can be creative generators, while formal systems serve as infallible judges.
For AI practitioners, this work demonstrates a practical architecture for combining generative models with symbolic verification. The pipeline’s iterative refinement — generate, attempt proof, filter, retry — is a template for any domain where correctness is paramount. It also highlights the importance of high-quality in-context examples: the LLM’s ability to propose novel theorems depends critically on the diversity and depth of the proof examples it sees.
Implications for AI Practitioners
First, this approach reduces the need for massive fine-tuning. By using in-context learning rather than retraining, practitioners can adapt the pipeline to new formal systems or domains with minimal computational cost. Second, the verification loop acts as a natural guardrail against hallucination — the LLM can propose anything, but only verified statements survive. This is a powerful pattern for deploying LLMs in high-stakes settings.
However, the pipeline’s success hinges on the quality of the formal environment. Lean’s rich library of existing theorems and proof tactics is a prerequisite; applying this to a less mature formal system would be far harder. Practitioners should also note that the novelty filter requires a database of known theorems to avoid rediscovery.
Key Takeaways
- LLMs can now generate novel, formally verified theorems, not just prove existing ones, using a pipeline that combines generative creativity with rigorous verification in Lean.
- The In-Context Proof Learning approach is resource-efficient — it avoids fine-tuning by relying on carefully curated examples, making it adaptable to other formal systems.
- The verification loop is a powerful antidote to hallucination — only outputs that pass formal checking survive, offering a template for deploying LLMs in correctness-critical domains.
- Success depends on the maturity of the formal environment — the approach is most viable where rich theorem libraries and proof tactics already exist, limiting its immediate applicability to nascent formal systems.