The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs
Overview
Overall Novelty Assessment
The paper introduces the Open Proof Corpus (OPC), a dataset of over 5,000 human-evaluated LLM-generated proofs targeting prestigious competition problems (USAMO, IMO). It resides in the Competition and Olympiad Problem Datasets leaf, which contains four papers total. This is a moderately populated research direction within the broader Benchmark Development branch, suggesting active but not overcrowded interest in competition-level evaluation resources. The work explicitly addresses gaps in understanding natural versus formal proof generation and the relationship between final-answer accuracy and full proof correctness.
The taxonomy reveals neighboring leaves focused on Formal Proof Benchmarks (five papers requiring proof-assistant verification) and Natural Language Proof Benchmarks (four papers emphasizing informal mathematical language). The OPC bridges these domains by collecting natural-language proofs for competition problems, a niche distinct from university-level coursework benchmarks (four papers) and specialized domain datasets (four papers in trigonometry, inequalities, etc.). Its scope_note emphasizes high-difficulty contest settings, differentiating it from general undergraduate or research-level problem collections that lack the competitive structure.
Among 30 candidates examined, the OPC dataset contribution shows overlap with two prior works out of ten candidates reviewed. The pipeline for generating and evaluating natural-language proofs (zero refutable candidates from ten examined) and the fine-tuned 8B-parameter judging model (zero from ten) appear more distinctive within this limited search scope. The dataset contribution's partial overlap likely reflects existing competition-problem collections, though the scale of human evaluation and focus on LLM-generated solutions may differentiate it. The pipeline and judging model contributions encounter less direct prior work among the candidates examined.
Based on top-30 semantic matches, the work occupies a recognizable but not densely populated niche. The dataset's novelty hinges on its scale of human evaluation and integration of multiple competition sources, while the methodological contributions (pipeline, judging model) show fewer overlaps in the examined literature. This analysis covers a targeted sample rather than exhaustive field coverage, so additional relevant work may exist beyond the search scope.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce a large-scale dataset of more than 5,000 LLM-generated mathematical proofs from prestigious competitions, each with binary human correctness judgments and feedback. The OPC is designed for training and evaluation in proof generation research.
The authors develop a systematic methodology involving problem and judge preparation, a grading procedure with clear guidelines, and monitoring and validation steps to ensure high-quality human evaluation of LLM-generated proofs.
The authors fine-tune an 8B-parameter model (R1-QWEN3-8B) using GRPO on the OPC, resulting in a model that achieves 88.1% accuracy in judging proof correctness, matching GEMINI-2.5-PRO and approaching GPT-5 performance.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[23] Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs PDF
[41] RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows PDF
[50] RIMO: An Easy-to-Evaluate, Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Open Proof Corpus (OPC) dataset
The authors introduce a large-scale dataset of more than 5,000 LLM-generated mathematical proofs from prestigious competitions, each with binary human correctness judgments and feedback. The OPC is designed for training and evaluation in proof generation research.
[12] Solving Inequality Proofs with Large Language Models PDF
[39] DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning PDF
[3] Naturalprover: Grounded mathematical proof generation with language models PDF
[9] Solving mathematical problems using large language models: A survey PDF
[15] Reliable Fine-Grained Evaluation of Natural Language Math Proofs PDF
[30] Benchmarking LLMs on Advanced Mathematical Reasoning PDF
[41] RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows PDF
[66] Criticlean: Critic-guided reinforcement learning for mathematical formalization PDF
[67] Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics PDF
[68] Understanding the Thinking Process of Reasoning Models: A Perspective from Schoenfeld's Episode Theory PDF
Rigorous pipeline for generating and evaluating natural language proofs
The authors develop a systematic methodology involving problem and judge preparation, a grading procedure with clear guidelines, and monitoring and validation steps to ensure high-quality human evaluation of LLM-generated proofs.
[51] Implementing a proposed framework for enhancing critical thinking skills in synthesizing AI-generated texts PDF
[52] Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning PDF
[53] DivLogicEval: A framework for benchmarking logical reasoning evaluation in large language models PDF
[54] Folio: Natural language reasoning with first-order logic PDF
[55] Mathesis: Towards Formal Theorem Proving from Natural Languages PDF
[56] Logicbench: Towards systematic evaluation of logical reasoning ability of large language models PDF
[57] CLUTRR: A diagnostic benchmark for inductive reasoning from text PDF
[58] Logical reasoning in formal and everyday reasoning tasks PDF
[59] Inspiration in human reasoning logic: Automating the inference and analysis of traffic accident information via macro-micro integration PDF
[60] Arn: Analogical reasoning on narratives PDF
Fine-tuned 8B-parameter proof judging model
The authors fine-tune an 8B-parameter model (R1-QWEN3-8B) using GRPO on the OPC, resulting in a model that achieves 88.1% accuracy in judging proof correctness, matching GEMINI-2.5-PRO and approaching GPT-5 performance.