FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels

ICLR 2026 Conference SubmissionAnonymous Authors
Formal Theorem ProvingBenchmarkMathematical ReasoningFormalizationAlgebraLean
Abstract:

Recent advances in large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, particularly on contest-based mathematical benchmarks like the IMO. However, these contests do not reflect the depth, breadth, and abstraction of modern mathematical research. To bridge this gap, we introduce FATE, a new benchmark series in formal algebra designed to chart a course toward advanced mathematical reasoning. We present two new components, FATE-H and FATE-X, each with 100 problems in abstract and commutative algebra. The FATE series spans a difficulty spectrum from undergraduate exercises to problems exceeding PhD qualifying exams. Notably, FATE-X is the first formal benchmark to surpass both PhD-level exam difficulty and the coverage of the Mathlib library. Our evaluations of state-of-the-art LLM provers on this new benchmark reveal a stark performance gap compared to contest math: the best model achieves only 3% (pass@64) accuracy on FATE-H and 0% on FATE-X. Our two-stage evaluation reveals that models' natural-language reasoning is notably more accurate than their ability to formalize this reasoning. We systematically classify the common errors that arise during this formalization process. Furthermore, a comparative study shows that a specialized prover can exhibit less effective reflection than general-purpose models, reducing its accuracy at the natural-language stage. We believe FATE provides a robust and challenging benchmark that establishes essential checkpoints on the path toward research-level formal mathematical reasoning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces FATE, a benchmark series for formal theorem proving in abstract and commutative algebra, spanning undergraduate exercises to problems exceeding PhD qualifying exams. According to the taxonomy, it resides in the 'Advanced Research-Level Benchmarks' leaf under 'Benchmark Development and Evaluation'. Notably, this leaf contains only the original paper itself—no sibling papers are listed—indicating that research-level formal algebra benchmarks at this difficulty tier represent a sparse, emerging direction within the field.

The taxonomy reveals that most benchmark activity concentrates in the sibling leaf 'Undergraduate and Contest Mathematics Benchmarks', which includes two papers on autoformalization and competition problems. Neighboring branches address 'Proof Automation and Assistance Systems' (LLM-based agents, proof repair) and 'Formalization of Algebraic Theories and Structures' (universal algebra, Lie algebras, linear algebra). FATE bridges these areas by providing evaluation targets for automation systems while complementing formalization efforts, yet its focus on PhD-level difficulty distinguishes it from existing undergraduate or contest-oriented datasets.

Among thirty candidates examined, none clearly refute any of the three contributions. The FATE benchmark series itself (ten candidates, zero refutable) appears novel in targeting formal algebra at research depth. The two-stage evaluation methodology separating natural-language reasoning from formalization (ten candidates, zero refutable) and the baseline performance analysis (ten candidates, zero refutable) likewise show no substantial prior overlap within the limited search scope. This suggests that combining research-level algebraic problems with systematic evaluation of LLM provers represents a relatively unexplored intersection.

Based on the top-thirty semantic matches and taxonomy structure, the work occupies a sparsely populated niche. The absence of sibling papers and zero refutable candidates across all contributions indicate that formal benchmarks exceeding PhD-level algebra difficulty are rare. However, the limited search scope means the analysis cannot rule out relevant work outside the candidate pool or in adjacent communities (e.g., interactive theorem proving workshops, domain-specific formalization projects).

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: formal theorem proving in advanced algebra. The field encompasses efforts to mechanize and verify sophisticated algebraic reasoning within proof assistants. The taxonomy reveals several major branches: Benchmark Development and Evaluation focuses on creating datasets and testbeds to measure progress, ranging from educational exercises to research-level challenges; Proof Automation and Assistance Systems explores tools and techniques that help users construct and repair proofs more efficiently; Formalization of Algebraic Theories and Structures documents the painstaking work of encoding abstract algebra—groups, rings, fields, Lie algebras—into machine-checkable libraries; Algebraic Methods in Automated Reasoning investigates how algebraic structures themselves (e.g., Kleene algebras, regular algebras) can drive decision procedures; Verification of Algebraic Computations and Circuits targets correctness of hardware and symbolic algorithms; Advanced Topics in Formal Mathematics addresses cutting-edge formalizations such as higher category theory and homotopy type theory; and Educational and Pedagogical Aspects considers how formal methods can support teaching and learning. Representative works include the Odd Order Theorem[6] formalization, Formal Linear Algebra[5] libraries, and newer benchmarks like ProofNet[1]. Within this landscape, a particularly active line of work concerns the design of challenging benchmarks that push the boundaries of what automated and interactive provers can handle. Research-level benchmarks aim to capture the difficulty of graduate-level and contemporary research problems, contrasting with more pedagogical or competition-oriented datasets like Olympiad Inequalities[8]. FATE[0] sits squarely in this advanced benchmark category, providing a collection of formal algebra problems intended to stress-test state-of-the-art systems. Compared to ProofNet[1], which spans multiple undergraduate domains, FATE[0] narrows its scope to deep algebraic content, and unlike Automated Proof Repair[2] or LeanAgent[10], which emphasize interactive assistance and agent-based solving, FATE[0] primarily offers a curated evaluation suite. This focus on rigorous, research-grade test cases complements ongoing formalization efforts in areas such as Lie Algebras Lean[12] and Cholesky Factorization[3], helping the community gauge whether new proof automation techniques can scale to the frontier of algebraic mathematics.

Claimed Contributions

FATE benchmark series for formal algebra theorem proving

The authors introduce FATE, a progressive benchmark series spanning undergraduate to post-PhD qualifying exam difficulty in formal algebra. FATE-H contains 100 graduate-level problems, FATE-X contains 100 PhD-level problems, and they extend FATE-M from 141 to 150 problems, forming a complete difficulty progression.

10 retrieved papers
Two-stage evaluation methodology for natural and formal reasoning

The authors develop a two-stage evaluation approach that separately assesses models' natural language mathematical reasoning and their formalization ability. This methodology reveals a significant gap between intermediate natural language accuracy and final formal proof generation, with systematic classification of formalization errors.

10 retrieved papers
Baseline performance evaluation and comparative analysis

The authors establish comprehensive baseline performance for state-of-the-art LLMs and specialized theorem provers on the FATE benchmarks, revealing stark performance gaps compared to contest mathematics and identifying formalization as the primary bottleneck rather than mathematical reasoning ability.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FATE benchmark series for formal algebra theorem proving

The authors introduce FATE, a progressive benchmark series spanning undergraduate to post-PhD qualifying exam difficulty in formal algebra. FATE-H contains 100 graduate-level problems, FATE-X contains 100 PhD-level problems, and they extend FATE-M from 141 to 150 problems, forming a complete difficulty progression.

Contribution

Two-stage evaluation methodology for natural and formal reasoning

The authors develop a two-stage evaluation approach that separately assesses models' natural language mathematical reasoning and their formalization ability. This methodology reveals a significant gap between intermediate natural language accuracy and final formal proof generation, with systematic classification of formalization errors.

Contribution

Baseline performance evaluation and comparative analysis

The authors establish comprehensive baseline performance for state-of-the-art LLMs and specialized theorem provers on the FATE benchmarks, revealing stark performance gaps compared to contest mathematics and identifying formalization as the primary bottleneck rather than mathematical reasoning ability.