The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.8 Download Report PDF

aiartificial intelligencereasoningllmmathbenchmarkdatasetproofgptmachine learning

In recent months, large language models (LLMs) have made significant progress in mathematical proof generation, but further advancement is hindered by the lack of a large-scale, high-quality dataset of human-evaluated proofs. While expensive to create, such a dataset is essential for driving improvements in training and addressing key open questions in the field of automated proof generation. Specifically, it remains unknown (1) how large the gap is between natural language and formal proof generation, (2) how final-answer accuracy relates to full proof correctness, and (3) how best-of-n selection strategies can affect proof quality. In this work, we present the Open Proof Corpus (OPC), a dataset comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was specifically designed for broad applicability and downstream usage in proof generation research and is the first large dataset of LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. Using the OPC, we address the open questions outlined above and provide new insights into LLMs' strengths and limitations in mathematical reasoning. Finally, to showcase the utility of the OPC, we finetune an 8B-parameter model on the dataset, obtaining a model that matches Gemini-2.5-Pro, and performs close to the best model, GPT-5, on evaluating proof correctness.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the Open Proof Corpus (OPC), a dataset of over 5,000 human-evaluated LLM-generated proofs targeting prestigious competition problems (USAMO, IMO). It resides in the Competition and Olympiad Problem Datasets leaf, which contains four papers total. This is a moderately populated research direction within the broader Benchmark Development branch, suggesting active but not overcrowded interest in competition-level evaluation resources. The work explicitly addresses gaps in understanding natural versus formal proof generation and the relationship between final-answer accuracy and full proof correctness.

The taxonomy reveals neighboring leaves focused on Formal Proof Benchmarks (five papers requiring proof-assistant verification) and Natural Language Proof Benchmarks (four papers emphasizing informal mathematical language). The OPC bridges these domains by collecting natural-language proofs for competition problems, a niche distinct from university-level coursework benchmarks (four papers) and specialized domain datasets (four papers in trigonometry, inequalities, etc.). Its scope_note emphasizes high-difficulty contest settings, differentiating it from general undergraduate or research-level problem collections that lack the competitive structure.

Among 30 candidates examined, the OPC dataset contribution shows overlap with two prior works out of ten candidates reviewed. The pipeline for generating and evaluating natural-language proofs (zero refutable candidates from ten examined) and the fine-tuned 8B-parameter judging model (zero from ten) appear more distinctive within this limited search scope. The dataset contribution's partial overlap likely reflects existing competition-problem collections, though the scale of human evaluation and focus on LLM-generated solutions may differentiate it. The pipeline and judging model contributions encounter less direct prior work among the candidates examined.

Based on top-30 semantic matches, the work occupies a recognizable but not densely populated niche. The dataset's novelty hinges on its scale of human evaluation and integration of multiple competition sources, while the methodological contributions (pipeline, judging model) show fewer overlaps in the examined literature. This analysis covers a targeted sample rather than exhaustive field coverage, so additional relevant work may exist beyond the search scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Large-scale evaluation of LLM-generated mathematical proofs. The field has organized itself around four main branches that collectively address how to rigorously assess machine-generated formal and informal mathematical reasoning. Benchmark Development and Dataset Construction focuses on creating diverse problem collections—ranging from competition and Olympiad datasets like Putnam AXIOM[23] and RIMO[50] to specialized domains such as inequality proofs and combinatorial identities—that provide standardized testbeds for model capabilities. Evaluation Methodologies and Metrics develops frameworks for measuring correctness, rigor, and logical coherence, including fine-grained error analysis and verification protocols. Model Architectures and Training Methodologies explores how different neural designs, training regimes, and proof-search strategies influence generation quality, with works like Llemma[19] and DeepSeek Prover[27] exemplifying domain-adapted language models. Survey and Meta-Analysis Studies, represented by efforts such as Math Reasoning Survey[5] and Math Age LLMs[7], synthesize progress across these dimensions and identify emerging challenges. Particularly active lines of work contrast automated theorem proving in formal systems—where tools like LeanDojo[32] and Baldur[2] leverage proof assistants for verifiable correctness—with natural-language proof generation, exemplified by Naturalprover[3] and Formalmath[1], which must balance human readability against formal rigor. Open questions center on scalability, the trade-off between symbolic verification and neural fluency, and the design of metrics that capture both syntactic validity and semantic insight. The original paper, Open Proof Corpus[0], sits within the Competition and Olympiad Problem Datasets cluster, contributing a large-scale resource for evaluating proof generation on challenging contest-style problems. Compared to neighbors like RefGrader[41], which emphasizes grading methodologies, Open Proof Corpus[0] prioritizes breadth and diversity of problem types, positioning itself as a foundational benchmark for stress-testing models across varied mathematical domains and difficulty levels.

Claimed Contributions

Open Proof Corpus (OPC) dataset

Can Refute

10 retrieved papers

The authors introduce a large-scale dataset of more than 5,000 LLM-generated mathematical proofs from prestigious competitions, each with binary human correctness judgments and feedback. The OPC is designed for training and evaluation in proof generation research.

10 retrieved papers

Can Refute

Rigorous pipeline for generating and evaluating natural language proofs

10 retrieved papers

The authors develop a systematic methodology involving problem and judge preparation, a grading procedure with clear guidelines, and monitoring and validation steps to ensure high-quality human evaluation of LLM-generated proofs.

10 retrieved papers

Fine-tuned 8B-parameter proof judging model

10 retrieved papers

The authors fine-tune an 8B-parameter model (R1-QWEN3-8B) using GRPO on the OPC, resulting in a model that achieves 88.1% accuracy in judging proof correctness, matching GEMINI-2.5-PRO and approaching GPT-5 performance.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[23] Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs PDF

Miranda, Brando, Aryan Gulati, Chen, Eric, Brando Miranda, Eric Chen, Emily Xia, Kai Fronsdal, Bruno Dumont, Koyejo, Sanmi, Elyas Obbad, Oluwasanmi Koyejo (2025) • International Conference on Machine Learning

[41] RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows PDF

Mahdavi, Hamed, Mahdavinia, Pouria, Hamed Mahdavi, Pouria Mahdavinia, Samira Malek, Hashemi, Alireza, Pegah Mohammadipour, Daliri, Majid, Alireza Hashemi, Farhadi, Majid Daliri, Khasahmadi, Amir, Alireza Farhadi, Amir Khasahmadi, Honavar, Vasant, Niloofar Mireshghallah, Vasant Honavar (2025) • arXiv.org

[50] RIMO: An Easy-to-Evaluate, Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning PDF

Chen Ziye, Qin, Chengwei, Ziye Chen, Shu Yao, Chengwei Qin, Yao Shu (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Open Proof Corpus (OPC) dataset

[12] Solving Inequality Proofs with Large Language Models PDF

Can Refute

[39] DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning PDF

Can Refute

[3] Naturalprover: Grounded mathematical proof generation with language models PDF

Cannot Refute

[9] Solving mathematical problems using large language models: A survey PDF

Cannot Refute

[15] Reliable Fine-Grained Evaluation of Natural Language Math Proofs PDF

Cannot Refute

[30] Benchmarking LLMs on Advanced Mathematical Reasoning PDF

Cannot Refute

[41] RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows PDF

Cannot Refute

[66] Criticlean: Critic-guided reinforcement learning for mathematical formalization PDF

Cannot Refute

[67] Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics PDF

Cannot Refute

[68] Understanding the Thinking Process of Reasoning Models: A Perspective from Schoenfeld's Episode Theory PDF

Cannot Refute

Contribution

Rigorous pipeline for generating and evaluating natural language proofs

[51] Implementing a proposed framework for enhancing critical thinking skills in synthesizing AI-generated texts PDF

Cannot Refute

[52] Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning PDF

Cannot Refute

[53] DivLogicEval: A framework for benchmarking logical reasoning evaluation in large language models PDF

Cannot Refute

[54] Folio: Natural language reasoning with first-order logic PDF

Cannot Refute

[55] Mathesis: Towards Formal Theorem Proving from Natural Languages PDF

Cannot Refute

[56] Logicbench: Towards systematic evaluation of logical reasoning ability of large language models PDF

Cannot Refute

[57] CLUTRR: A diagnostic benchmark for inductive reasoning from text PDF

Cannot Refute

[58] Logical reasoning in formal and everyday reasoning tasks PDF

Cannot Refute

[59] Inspiration in human reasoning logic: Automating the inference and analysis of traffic accident information via macro-micro integration PDF

Cannot Refute

[60] Arn: Analogical reasoning on narratives PDF

Cannot Refute

Contribution

Fine-tuned 8B-parameter proof judging model

[15] Reliable Fine-Grained Evaluation of Natural Language Math Proofs PDF

Cannot Refute

[19] Llemma: An Open Language Model For Mathematics PDF

Cannot Refute

[27] DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search PDF

Cannot Refute

[39] DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning PDF

Cannot Refute

[50] RIMO: An Easy-to-Evaluate, Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning PDF

Cannot Refute

[61] Theorem proving in artificial neural networks: new frontiers in mathematical AI PDF

Cannot Refute

[62] Evaluating mathematical reasoning beyond accuracy PDF

Cannot Refute

[63] Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction PDF

Cannot Refute

[64] Small language models need strong verifiers to self-correct reasoning PDF

Cannot Refute

[65] Towards Robust Mathematical Reasoning PDF

Cannot Refute

The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[23] Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs PDF

[41] RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows PDF

[50] RIMO: An Easy-to-Evaluate, Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning PDF

Contribution Analysis

Open Proof Corpus (OPC) dataset

[12] Solving Inequality Proofs with Large Language Models PDF

[39] DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning PDF

[3] Naturalprover: Grounded mathematical proof generation with language models PDF

[9] Solving mathematical problems using large language models: A survey PDF

[15] Reliable Fine-Grained Evaluation of Natural Language Math Proofs PDF

[30] Benchmarking LLMs on Advanced Mathematical Reasoning PDF

[41] RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows PDF

[66] Criticlean: Critic-guided reinforcement learning for mathematical formalization PDF

[67] Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics PDF

[68] Understanding the Thinking Process of Reasoning Models: A Perspective from Schoenfeld's Episode Theory PDF

Rigorous pipeline for generating and evaluating natural language proofs

[51] Implementing a proposed framework for enhancing critical thinking skills in synthesizing AI-generated texts PDF

[52] Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning PDF

[53] DivLogicEval: A framework for benchmarking logical reasoning evaluation in large language models PDF

[54] Folio: Natural language reasoning with first-order logic PDF

[55] Mathesis: Towards Formal Theorem Proving from Natural Languages PDF

[56] Logicbench: Towards systematic evaluation of logical reasoning ability of large language models PDF

[57] CLUTRR: A diagnostic benchmark for inductive reasoning from text PDF

[58] Logical reasoning in formal and everyday reasoning tasks PDF

[59] Inspiration in human reasoning logic: Automating the inference and analysis of traffic accident information via macro-micro integration PDF

[60] Arn: Analogical reasoning on narratives PDF

Fine-tuned 8B-parameter proof judging model

[15] Reliable Fine-Grained Evaluation of Natural Language Math Proofs PDF

[19] Llemma: An Open Language Model For Mathematics PDF

[27] DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search PDF

[39] DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning PDF

[50] RIMO: An Easy-to-Evaluate, Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning PDF

[61] Theorem proving in artificial neural networks: new frontiers in mathematical AI PDF

[62] Evaluating mathematical reasoning beyond accuracy PDF

[63] Goedel-prover-v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction PDF

[64] Small language models need strong verifiers to self-correct reasoning PDF

[65] Towards Robust Mathematical Reasoning PDF

Table of Contents