ProofFlow: A Dependency Graph Approach to Faithful Proof Autoformalization

ICLR 2026 Conference SubmissionAnonymous Authors
AutoformalizationLarge Language ModelsDependency GraphLean (Formal Language)Structural FidelitySemantic Faithfulness
Abstract:

Proof autoformalization, the task of translating natural language theorems and proofs into machine-verifiable code, is a critical step for integrating large language models into rigorous mathematical workflows. Current approaches focus on producing executable code, but they frequently fail to preserve the semantic meaning and logical structure of the original human-written argument. To address this, we introduce ProofFlow, a novel pipeline that treats structural fidelity as a primary objective. ProofFlow first constructs a directed acyclic graph (DAG) to map the logical dependencies between proof steps. Then, it employs a novel lemma-based approach to systematically formalize each step as an intermediate lemma, preserving the logical structure of the original argument. To facilitate evaluation, we present a new benchmark of 184 undergraduate-level problems, manually annotated with step-by-step solutions and logical dependency graphs, and introduce ProofScore, a new composite metric to evaluate syntactic correctness, semantic faithfulness, and structural fidelity. Experimental results show our pipeline sets a new state-of-the-art for autoformalization, achieving a ProofScore of 0.545, substantially exceeding baselines like full-proof formalization (0.279), which processes the entire proof at once, and step-proof formalization (0.046), which handles each step independently. Our pipeline, benchmark, and score metric are open-sourced to encourage further progress at https://anonymous.4open.science/r/ProofFlow-351E.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

ProofFlow introduces a pipeline that constructs directed acyclic graphs to map logical dependencies between proof steps, then formalizes each step as an intermediate lemma to preserve structural fidelity. The taxonomy places this work in the 'Dependency Graph-Based Formalization' leaf under 'Structure-Aware Autoformalization', which currently contains only this paper as its sole member. This indicates a relatively sparse research direction within the broader autoformalization landscape, suggesting the explicit graph-based structural modeling approach is not yet widely explored in the literature.

The taxonomy reveals that ProofFlow's parent branch, 'Structure-Aware Autoformalization', sits alongside 'End-to-End Neural Autoformalization' and 'Controlled Natural Language Formalization' as major methodological divisions. Neighboring leaves include 'Incremental Step-by-Step Formalization' (which processes proofs sequentially with verification feedback) and 'Full-Proof Autoformalization' (which translates complete proofs without decomposition). ProofFlow diverges from these by explicitly modeling dependency graphs before formalization, occupying a distinct methodological niche that bridges structural analysis and systematic translation.

Among the 24 candidates examined through semantic search, none were found to clearly refute any of ProofFlow's three contributions. The ProofFlow pipeline examined 10 candidates with zero refutable overlaps, the ProofScore metric examined 6 candidates with zero refutations, and the ProofFlowBench benchmark examined 8 candidates with zero refutations. This suggests that within the limited search scope, the combination of dependency graph construction, lemma-based formalization, and the specific evaluation framework appears relatively novel, though the analysis does not cover the entire field exhaustively.

Based on the top-24 semantic matches and the taxonomy structure, ProofFlow appears to occupy a sparsely populated methodological space. The absence of sibling papers in its taxonomy leaf and the lack of clear prior work overlap in the examined candidates suggest meaningful novelty, though this assessment is constrained by the limited search scope. A more comprehensive literature review covering additional venues and earlier foundational work in proof structure analysis would strengthen confidence in this preliminary assessment.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Translating natural language mathematical proofs into formal verification code. The field has evolved into several distinct branches that address different facets of this challenge. Autoformalization Methods and Systems explore techniques for converting informal mathematics into machine-checkable code, ranging from structure-aware approaches like dependency graph-based methods to controlled natural language interfaces (Naproche[46], Controlled Natural Language[38]). Formal Proof Synthesis and Search focus on generating and discovering proofs within formal systems, often leveraging neural methods (Deep Learning Theorem Proving[9]) or structured search strategies (Draft Sketch Prove[14]). Benchmarks and Datasets provide the empirical foundation, with resources like ProofNet[5] and Lean Workbook[7] enabling systematic evaluation. Verification and Alignment Evaluation address the correctness and fidelity of translations (FormalAlign[15]), while Domain-Specific and Applied Formalization targets particular mathematical areas such as Euclidean geometry (Euclidean Geometry Proofs[12]). Surveys and Overviews (Autoformalization Survey[10]) synthesize progress, and Foundations and Theoretical Perspectives examine the underlying principles of proof and computation (Proof and Computation[28]). Recent work has intensified around structure-aware autoformalization, where methods exploit the logical dependencies and hierarchical organization of proofs rather than treating them as flat text. ProofFlow[0] exemplifies this trend by using dependency graphs to guide formalization, situating itself within a small cluster of works that parse proof structure explicitly. This contrasts with earlier efforts like Autoformalization LLMs[2], which rely more heavily on end-to-end neural translation, and with step-by-step approaches such as StepProof[4] that incrementally build formal statements. A key trade-off emerges between preserving the natural proof's modularity—enabling easier debugging and human readability—and achieving high automation with minimal user intervention. ProofFlow[0] leans toward the former, emphasizing how dependency-aware decomposition can improve both correctness and interpretability, while neighboring systems like Informal to Formal[3] explore hybrid strategies that balance structure and flexibility. Open questions remain about scalability to complex, multi-layered arguments and the extent to which graph-based representations generalize across diverse mathematical domains.

Claimed Contributions

ProofFlow pipeline for structure-preserving proof autoformalization

The authors propose a three-stage pipeline that constructs a directed acyclic graph (DAG) to map logical dependencies between proof steps, then employs a lemma-based approach to systematically formalize each step as an intermediate lemma, preserving the logical structure of the original natural language proof.

10 retrieved papers
ProofScore metric for comprehensive autoformalization evaluation

The authors develop a unified scoring method that explicitly measures three key properties of autoformalized proofs: syntactic correctness (no compilation errors), semantic faithfulness (preserving mathematical meaning), and structural fidelity (preserving the proof's dependency graph).

6 retrieved papers
ProofFlowBench benchmark dataset with annotated dependency graphs

The authors introduce a curated benchmark dataset containing 184 undergraduate-level mathematics theorems and proofs from six key areas, each manually annotated with proof steps divided into logical components and their respective dependency graphs for evaluating structural fidelity.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ProofFlow pipeline for structure-preserving proof autoformalization

The authors propose a three-stage pipeline that constructs a directed acyclic graph (DAG) to map logical dependencies between proof steps, then employs a lemma-based approach to systematically formalize each step as an intermediate lemma, preserving the logical structure of the original natural language proof.

Contribution

ProofScore metric for comprehensive autoformalization evaluation

The authors develop a unified scoring method that explicitly measures three key properties of autoformalized proofs: syntactic correctness (no compilation errors), semantic faithfulness (preserving mathematical meaning), and structural fidelity (preserving the proof's dependency graph).

Contribution

ProofFlowBench benchmark dataset with annotated dependency graphs

The authors introduce a curated benchmark dataset containing 184 undergraduate-level mathematics theorems and proofs from six key areas, each manually annotated with proof steps divided into logical components and their respective dependency graphs for evaluating structural fidelity.

ProofFlow: A Dependency Graph Approach to Faithful Proof Autoformalization | Novelty Validation