The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs

ICLR 2026 Conference SubmissionAnonymous Authors
aiartificial intelligencereasoningllmmathbenchmarkdatasetproofgptmachine learning
Abstract:

In recent months, large language models (LLMs) have made significant progress in mathematical proof generation, but further advancement is hindered by the lack of a large-scale, high-quality dataset of human-evaluated proofs. While expensive to create, such a dataset is essential for driving improvements in training and addressing key open questions in the field of automated proof generation. Specifically, it remains unknown (1) how large the gap is between natural language and formal proof generation, (2) how final-answer accuracy relates to full proof correctness, and (3) how best-of-n selection strategies can affect proof quality. In this work, we present the Open Proof Corpus (OPC), a dataset comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was specifically designed for broad applicability and downstream usage in proof generation research and is the first large dataset of LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. Using the OPC, we address the open questions outlined above and provide new insights into LLMs' strengths and limitations in mathematical reasoning. Finally, to showcase the utility of the OPC, we finetune an 8B-parameter model on the dataset, obtaining a model that matches Gemini-2.5-Pro, and performs close to the best model, GPT-5, on evaluating proof correctness.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the Open Proof Corpus (OPC), a dataset of over 5,000 human-evaluated LLM-generated proofs targeting prestigious competition problems (USAMO, IMO). It resides in the Competition and Olympiad Problem Datasets leaf, which contains four papers total. This is a moderately populated research direction within the broader Benchmark Development branch, suggesting active but not overcrowded interest in competition-level evaluation resources. The work explicitly addresses gaps in understanding natural versus formal proof generation and the relationship between final-answer accuracy and full proof correctness.

The taxonomy reveals neighboring leaves focused on Formal Proof Benchmarks (five papers requiring proof-assistant verification) and Natural Language Proof Benchmarks (four papers emphasizing informal mathematical language). The OPC bridges these domains by collecting natural-language proofs for competition problems, a niche distinct from university-level coursework benchmarks (four papers) and specialized domain datasets (four papers in trigonometry, inequalities, etc.). Its scope_note emphasizes high-difficulty contest settings, differentiating it from general undergraduate or research-level problem collections that lack the competitive structure.

Among 30 candidates examined, the OPC dataset contribution shows overlap with two prior works out of ten candidates reviewed. The pipeline for generating and evaluating natural-language proofs (zero refutable candidates from ten examined) and the fine-tuned 8B-parameter judging model (zero from ten) appear more distinctive within this limited search scope. The dataset contribution's partial overlap likely reflects existing competition-problem collections, though the scale of human evaluation and focus on LLM-generated solutions may differentiate it. The pipeline and judging model contributions encounter less direct prior work among the candidates examined.

Based on top-30 semantic matches, the work occupies a recognizable but not densely populated niche. The dataset's novelty hinges on its scale of human evaluation and integration of multiple competition sources, while the methodological contributions (pipeline, judging model) show fewer overlaps in the examined literature. This analysis covers a targeted sample rather than exhaustive field coverage, so additional relevant work may exist beyond the search scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Large-scale evaluation of LLM-generated mathematical proofs. The field has organized itself around four main branches that collectively address how to rigorously assess machine-generated formal and informal mathematical reasoning. Benchmark Development and Dataset Construction focuses on creating diverse problem collections—ranging from competition and Olympiad datasets like Putnam AXIOM[23] and RIMO[50] to specialized domains such as inequality proofs and combinatorial identities—that provide standardized testbeds for model capabilities. Evaluation Methodologies and Metrics develops frameworks for measuring correctness, rigor, and logical coherence, including fine-grained error analysis and verification protocols. Model Architectures and Training Methodologies explores how different neural designs, training regimes, and proof-search strategies influence generation quality, with works like Llemma[19] and DeepSeek Prover[27] exemplifying domain-adapted language models. Survey and Meta-Analysis Studies, represented by efforts such as Math Reasoning Survey[5] and Math Age LLMs[7], synthesize progress across these dimensions and identify emerging challenges. Particularly active lines of work contrast automated theorem proving in formal systems—where tools like LeanDojo[32] and Baldur[2] leverage proof assistants for verifiable correctness—with natural-language proof generation, exemplified by Naturalprover[3] and Formalmath[1], which must balance human readability against formal rigor. Open questions center on scalability, the trade-off between symbolic verification and neural fluency, and the design of metrics that capture both syntactic validity and semantic insight. The original paper, Open Proof Corpus[0], sits within the Competition and Olympiad Problem Datasets cluster, contributing a large-scale resource for evaluating proof generation on challenging contest-style problems. Compared to neighbors like RefGrader[41], which emphasizes grading methodologies, Open Proof Corpus[0] prioritizes breadth and diversity of problem types, positioning itself as a foundational benchmark for stress-testing models across varied mathematical domains and difficulty levels.

Claimed Contributions

Open Proof Corpus (OPC) dataset

The authors introduce a large-scale dataset of more than 5,000 LLM-generated mathematical proofs from prestigious competitions, each with binary human correctness judgments and feedback. The OPC is designed for training and evaluation in proof generation research.

10 retrieved papers
Can Refute
Rigorous pipeline for generating and evaluating natural language proofs

The authors develop a systematic methodology involving problem and judge preparation, a grading procedure with clear guidelines, and monitoring and validation steps to ensure high-quality human evaluation of LLM-generated proofs.

10 retrieved papers
Fine-tuned 8B-parameter proof judging model

The authors fine-tune an 8B-parameter model (R1-QWEN3-8B) using GRPO on the OPC, resulting in a model that achieves 88.1% accuracy in judging proof correctness, matching GEMINI-2.5-PRO and approaching GPT-5 performance.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Open Proof Corpus (OPC) dataset

The authors introduce a large-scale dataset of more than 5,000 LLM-generated mathematical proofs from prestigious competitions, each with binary human correctness judgments and feedback. The OPC is designed for training and evaluation in proof generation research.

Contribution

Rigorous pipeline for generating and evaluating natural language proofs

The authors develop a systematic methodology involving problem and judge preparation, a grading procedure with clear guidelines, and monitoring and validation steps to ensure high-quality human evaluation of LLM-generated proofs.

Contribution

Fine-tuned 8B-parameter proof judging model

The authors fine-tune an 8B-parameter model (R1-QWEN3-8B) using GRPO on the OPC, resulting in a model that achieves 88.1% accuracy in judging proof correctness, matching GEMINI-2.5-PRO and approaching GPT-5 performance.