FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Formal Theorem ProvingBenchmarkMathematical ReasoningFormalizationAlgebraLean

Recent advances in large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, particularly on contest-based mathematical benchmarks like the IMO. However, these contests do not reflect the depth, breadth, and abstraction of modern mathematical research. To bridge this gap, we introduce FATE, a new benchmark series in formal algebra designed to chart a course toward advanced mathematical reasoning. We present two new components, FATE-H and FATE-X, each with 100 problems in abstract and commutative algebra. The FATE series spans a difficulty spectrum from undergraduate exercises to problems exceeding PhD qualifying exams. Notably, FATE-X is the first formal benchmark to surpass both PhD-level exam difficulty and the coverage of the Mathlib library. Our evaluations of state-of-the-art LLM provers on this new benchmark reveal a stark performance gap compared to contest math: the best model achieves only 3% (pass@64) accuracy on FATE-H and 0% on FATE-X. Our two-stage evaluation reveals that models' natural-language reasoning is notably more accurate than their ability to formalize this reasoning. We systematically classify the common errors that arise during this formalization process. Furthermore, a comparative study shows that a specialized prover can exhibit less effective reflection than general-purpose models, reducing its accuracy at the natural-language stage. We believe FATE provides a robust and challenging benchmark that establishes essential checkpoints on the path toward research-level formal mathematical reasoning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces FATE, a benchmark series for formal theorem proving in abstract and commutative algebra, spanning undergraduate exercises to problems exceeding PhD qualifying exams. According to the taxonomy, it resides in the 'Advanced Research-Level Benchmarks' leaf under 'Benchmark Development and Evaluation'. Notably, this leaf contains only the original paper itself—no sibling papers are listed—indicating that research-level formal algebra benchmarks at this difficulty tier represent a sparse, emerging direction within the field.

The taxonomy reveals that most benchmark activity concentrates in the sibling leaf 'Undergraduate and Contest Mathematics Benchmarks', which includes two papers on autoformalization and competition problems. Neighboring branches address 'Proof Automation and Assistance Systems' (LLM-based agents, proof repair) and 'Formalization of Algebraic Theories and Structures' (universal algebra, Lie algebras, linear algebra). FATE bridges these areas by providing evaluation targets for automation systems while complementing formalization efforts, yet its focus on PhD-level difficulty distinguishes it from existing undergraduate or contest-oriented datasets.

Among thirty candidates examined, none clearly refute any of the three contributions. The FATE benchmark series itself (ten candidates, zero refutable) appears novel in targeting formal algebra at research depth. The two-stage evaluation methodology separating natural-language reasoning from formalization (ten candidates, zero refutable) and the baseline performance analysis (ten candidates, zero refutable) likewise show no substantial prior overlap within the limited search scope. This suggests that combining research-level algebraic problems with systematic evaluation of LLM provers represents a relatively unexplored intersection.

Based on the top-thirty semantic matches and taxonomy structure, the work occupies a sparsely populated niche. The absence of sibling papers and zero refutable candidates across all contributions indicate that formal benchmarks exceeding PhD-level algebra difficulty are rare. However, the limited search scope means the analysis cannot rule out relevant work outside the candidate pool or in adjacent communities (e.g., interactive theorem proving workshops, domain-specific formalization projects).

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: formal theorem proving in advanced algebra. The field encompasses efforts to mechanize and verify sophisticated algebraic reasoning within proof assistants. The taxonomy reveals several major branches: Benchmark Development and Evaluation focuses on creating datasets and testbeds to measure progress, ranging from educational exercises to research-level challenges; Proof Automation and Assistance Systems explores tools and techniques that help users construct and repair proofs more efficiently; Formalization of Algebraic Theories and Structures documents the painstaking work of encoding abstract algebra—groups, rings, fields, Lie algebras—into machine-checkable libraries; Algebraic Methods in Automated Reasoning investigates how algebraic structures themselves (e.g., Kleene algebras, regular algebras) can drive decision procedures; Verification of Algebraic Computations and Circuits targets correctness of hardware and symbolic algorithms; Advanced Topics in Formal Mathematics addresses cutting-edge formalizations such as higher category theory and homotopy type theory; and Educational and Pedagogical Aspects considers how formal methods can support teaching and learning. Representative works include the Odd Order Theorem[6] formalization, Formal Linear Algebra[5] libraries, and newer benchmarks like ProofNet[1]. Within this landscape, a particularly active line of work concerns the design of challenging benchmarks that push the boundaries of what automated and interactive provers can handle. Research-level benchmarks aim to capture the difficulty of graduate-level and contemporary research problems, contrasting with more pedagogical or competition-oriented datasets like Olympiad Inequalities[8]. FATE[0] sits squarely in this advanced benchmark category, providing a collection of formal algebra problems intended to stress-test state-of-the-art systems. Compared to ProofNet[1], which spans multiple undergraduate domains, FATE[0] narrows its scope to deep algebraic content, and unlike Automated Proof Repair[2] or LeanAgent[10], which emphasize interactive assistance and agent-based solving, FATE[0] primarily offers a curated evaluation suite. This focus on rigorous, research-grade test cases complements ongoing formalization efforts in areas such as Lie Algebras Lean[12] and Cholesky Factorization[3], helping the community gauge whether new proof automation techniques can scale to the frontier of algebraic mathematics.

Claimed Contributions

FATE benchmark series for formal algebra theorem proving

10 retrieved papers

The authors introduce FATE, a progressive benchmark series spanning undergraduate to post-PhD qualifying exam difficulty in formal algebra. FATE-H contains 100 graduate-level problems, FATE-X contains 100 PhD-level problems, and they extend FATE-M from 141 to 150 problems, forming a complete difficulty progression.

10 retrieved papers

Two-stage evaluation methodology for natural and formal reasoning

10 retrieved papers

The authors develop a two-stage evaluation approach that separately assesses models' natural language mathematical reasoning and their formalization ability. This methodology reveals a significant gap between intermediate natural language accuracy and final formal proof generation, with systematic classification of formalization errors.

10 retrieved papers

Baseline performance evaluation and comparative analysis

10 retrieved papers

The authors establish comprehensive baseline performance for state-of-the-art LLMs and specialized theorem provers on the FATE benchmarks, revealing stark performance gaps compared to contest mathematics and identifying formalization as the primary bottleneck rather than mathematical reasoning ability.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FATE benchmark series for formal algebra theorem proving

[1] ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics PDF

Cannot Refute

[51] DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition PDF

Cannot Refute

[61] Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai PDF

Cannot Refute

[62] REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning PDF

Cannot Refute

[63] Minif2f: a cross-system benchmark for formal olympiad-level mathematics PDF

Cannot Refute

[64] Formalmath: Benchmarking formal mathematical reasoning of large language models PDF

Cannot Refute

[65] PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition PDF

Cannot Refute

[66] Putnambench: A multilingual competition-mathematics benchmark for formal theorem-proving PDF

Cannot Refute

[67] miniCTX: Neural Theorem Proving with (Long-)Contexts PDF

Cannot Refute

[68] Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities PDF

Cannot Refute

Contribution

Two-stage evaluation methodology for natural and formal reasoning

[51] DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition PDF

Cannot Refute

[52] A survey on deep learning for theorem proving PDF

Cannot Refute

[53] Bridging informal reasoning and formal proving: The role of argumentation in proof-events PDF

Cannot Refute

[54] Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning PDF

Cannot Refute

[55] Steering LLMs for Formal Theorem Proving PDF

Cannot Refute

[56] Thinking machines: Mathematical reasoning in the age of llms PDF

Cannot Refute

[57] Draft, sketch, and prove: Guiding formal theorem provers with informal proofs PDF

Cannot Refute

[58] DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data PDF

Cannot Refute

[59] Verification and Refinement of Natural Language Explanations through LLM-Symbolic Theorem Proving PDF

Cannot Refute

[60] Prover Agent: An Agent-based Framework for Formal Mathematical Proofs PDF

Cannot Refute

Contribution

Baseline performance evaluation and comparative analysis

[51] DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition PDF

Cannot Refute

[52] A survey on deep learning for theorem proving PDF

Cannot Refute

[66] Putnambench: A multilingual competition-mathematics benchmark for formal theorem-proving PDF

Cannot Refute

[69] Subgoal-based demonstration learning for formal theorem proving PDF

Cannot Refute

[70] Llemma: An open language model for mathematics PDF

Cannot Refute

[71] Generative language modeling for automated theorem proving PDF

Cannot Refute

[72] Leanabell-prover-v2: Verifier-integrated reasoning for formal theorem proving via reinforcement learning PDF

Cannot Refute

[73] Bc-prover: Backward chaining prover for formal theorem proving PDF

Cannot Refute

[74] Subgoalxl: Subgoal-based expert learning for theorem proving PDF

Cannot Refute

[75] Leandojo: Theorem proving with retrieval-augmented language models PDF

Cannot Refute

FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

FATE benchmark series for formal algebra theorem proving

[1] ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics PDF

[51] DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition PDF

[61] Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai PDF

[62] REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning PDF

[63] Minif2f: a cross-system benchmark for formal olympiad-level mathematics PDF

[64] Formalmath: Benchmarking formal mathematical reasoning of large language models PDF

[65] PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition PDF

[66] Putnambench: A multilingual competition-mathematics benchmark for formal theorem-proving PDF

[67] miniCTX: Neural Theorem Proving with (Long-)Contexts PDF

[68] Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities PDF

Two-stage evaluation methodology for natural and formal reasoning

[51] DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition PDF

[52] A survey on deep learning for theorem proving PDF

[53] Bridging informal reasoning and formal proving: The role of argumentation in proof-events PDF

[54] Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning PDF

[55] Steering LLMs for Formal Theorem Proving PDF

[56] Thinking machines: Mathematical reasoning in the age of llms PDF

[57] Draft, sketch, and prove: Guiding formal theorem provers with informal proofs PDF

[58] DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data PDF

[59] Verification and Refinement of Natural Language Explanations through LLM-Symbolic Theorem Proving PDF

[60] Prover Agent: An Agent-based Framework for Formal Mathematical Proofs PDF

Baseline performance evaluation and comparative analysis

[51] DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition PDF

[52] A survey on deep learning for theorem proving PDF

[66] Putnambench: A multilingual competition-mathematics benchmark for formal theorem-proving PDF

[69] Subgoal-based demonstration learning for formal theorem proving PDF

[70] Llemma: An open language model for mathematics PDF

[71] Generative language modeling for automated theorem proving PDF

[72] Leanabell-prover-v2: Verifier-integrated reasoning for formal theorem proving via reinforcement learning PDF

[73] Bc-prover: Backward chaining prover for formal theorem proving PDF

[74] Subgoalxl: Subgoal-based expert learning for theorem proving PDF

[75] Leandojo: Theorem proving with retrieval-augmented language models PDF

Table of Contents