Aria: An Agent For Retrieval and Iterative Auto-Formalization via Dependency Graph
arXiv:2510.04520v2 Announce Type: replace Abstract: Accurate auto-formalization of theorem statements is essential for advancing automated discovery and verification of research-level mathematics, yet remains a major bottleneck for LLMs due to hallucinations, semantic mismatches, and their...
The Formalization Bottleneck
The latest arXiv preprint on Aria tackles one of the most stubborn obstacles in AI-driven mathematics: reliably translating natural-language theorem statements into formal, machine-verifiable code. While large language models have shown surprising fluency in generating mathematical prose, they remain notoriously unreliable when asked to produce precise formal specifications—a task where even a single misplaced quantifier or type mismatch renders the entire statement useless.
Aria’s core innovation is its use of a dependency graph to guide an iterative auto-formalization process. Instead of attempting a one-shot translation, the system first decomposes the theorem into its constituent dependencies—definitions, lemmas, and prior results—then formalizes each component in a structured, sequential manner. This approach directly addresses the hallucination problem: by explicitly tracking what each formalized component depends on, Aria can detect when a generated formalization introduces spurious symbols or mismatches with existing library definitions.
Why This Matters
The significance here extends beyond theorem proving. Auto-formalization is the critical bridge between the informal mathematical literature—where most human knowledge resides—and formal verification systems like Lean, Coq, or Isabelle. Without reliable translation, the dream of automated mathematical discovery remains stuck: an LLM might propose a novel theorem, but if it cannot formalize it correctly, no verifier can check it, and no mathematician can trust it.
For AI practitioners, Aria’s dependency-graph approach offers a broader architectural lesson. The technique mirrors how human mathematicians work: they don’t formalize a complex statement in one pass, but rather build up from known definitions, checking each step against existing knowledge. This suggests that many AI tasks requiring high precision—code generation, legal reasoning, scientific protocol design—could benefit from similar decomposition strategies that externalize and track dependencies.
Implications for Practitioners
First, this work validates the growing consensus that single-pass generation is insufficient for tasks requiring formal correctness. Practitioners building systems for code generation or formal verification should consider multi-step, dependency-aware pipelines rather than relying on prompt engineering alone.
Second, Aria’s approach is computationally tractable. The dependency graph is constructed automatically from the theorem statement and existing formal libraries, meaning the method can scale to large mathematical corpora without manual annotation. Teams working on automated reasoning should explore similar graph-based decomposition for their own formalization pipelines.
Third, the iterative refinement loop—where the system checks its own output against the dependency graph and retries when mismatches occur—provides a concrete template for building self-correcting AI systems. This is far more robust than relying on LLM self-critique, which often fails to catch subtle formal errors.
Key Takeaways
- Aria addresses a critical bottleneck: Reliable auto-formalization of mathematical theorems, which existing LLMs handle poorly due to hallucinations and semantic mismatches.
- Dependency graphs enable structured decomposition: By breaking formalization into tracked dependencies, Aria reduces errors and improves consistency compared to one-shot generation.
- The approach is architecturally instructive: Practitioners in code generation, legal AI, and scientific automation should consider similar dependency-aware, iterative pipelines.
- Self-correction via graph checking outperforms LLM self-critique: The system catches errors by comparing generated formalizations against known dependencies, not by relying on the model’s own judgment.