OpenThoughts: Data Recipes for Reasoning Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

ReasoningDataLLM

Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best train- ing recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. Our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data genera- tion pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Dia- mond – improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on ANONYMIZED.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes OpenThoughts3-1.2M, a systematically refined dataset for training reasoning models, and OpenThinker3-7B, which achieves state-of-the-art results on AIME, LiveCodeBench, and GPQA Diamond. It resides in the 'Model Distillation and Trace Generation' leaf, which contains five papers total, including the original work. This leaf sits within the broader 'Dataset Construction Methodologies' branch, indicating a moderately populated research direction focused on extracting reasoning traces from stronger models rather than synthetic generation or human annotation.

The taxonomy reveals neighboring leaves such as 'Synthetic Data Generation' (five papers) and 'Multi-Agent and Iterative Refinement' (three papers), both under the same parent branch. The paper's distillation-based approach contrasts with template-driven synthesis methods in the former and multi-agent verification strategies in the latter. Its sibling papers include OpenThoughts (earlier version), Distilled Reasoning Million, ReasonMed, and OpenRTLSet, which collectively explore distillation at scale, domain adaptation, and hardware-specific reasoning. The taxonomy structure suggests this is an active but not overcrowded subfield within dataset construction.

Among 30 candidates examined, none clearly refute the three main contributions. For the OpenThoughts3-1.2M dataset and pipeline, 10 candidates were reviewed with zero refutable overlaps. Similarly, the OpenThinker3-7B model and empirical insights on data curation each examined 10 candidates without finding prior work that directly overlaps. This limited search scope suggests the specific combination of systematic pipeline investigation (1,000+ experiments), QwQ-32B distillation, and the resulting performance gains may be novel within the examined literature, though the analysis does not cover the full field.

Based on top-30 semantic matches, the work appears to advance distillation-based dataset construction through systematic experimentation and achieves notable empirical results. However, the search scope is constrained, and the taxonomy shows this is an established research direction with multiple concurrent efforts. The novelty likely lies in the methodological rigor of pipeline optimization and the specific performance improvements demonstrated, rather than introducing entirely new conceptual approaches to reasoning dataset creation.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: creating open-source datasets for training reasoning models. The field organizes around four main branches that reflect different aspects of dataset development. Dataset Construction Methodologies focuses on techniques for generating high-quality reasoning traces, including model distillation approaches that extract intermediate steps from capable models, synthetic data generation methods, and human annotation frameworks. Domain-Specific Reasoning Datasets targets specialized areas such as mathematics, science, medicine, and finance, where domain expertise shapes both problem formulation and solution strategies. General Reasoning Datasets and Benchmarks encompasses broader logical reasoning, multi-hop question answering, and cross-domain evaluation suites that test transferable reasoning skills. Training Frameworks and Model Architectures addresses how datasets integrate with learning paradigms, including reinforcement learning from reasoning traces and architectural innovations that leverage structured thought processes. Representative works like OpenThoughts[0] and Open Reasoner Zero[1] illustrate distillation-based construction, while Llama Nemotron[3] and Openr Framework[2] demonstrate end-to-end training pipelines. A central tension across these branches involves balancing scale, diversity, and trace quality: distillation methods can rapidly produce large volumes of reasoning steps but may inherit biases from teacher models, whereas human-curated datasets offer higher fidelity at greater cost. Within the Model Distillation and Trace Generation cluster, OpenThoughts[0] emphasizes extracting diverse reasoning patterns from frontier models to create broadly applicable training data, positioning itself alongside efforts like Distilled Reasoning Million[24] that prioritize volume and ReasonMed[13] that targets domain adaptation. Compared to OpenRTLSet[14], which focuses on hardware-specific reasoning, OpenThoughts[0] pursues more general-purpose trace generation. Meanwhile, works like AIMO Winner[6] demonstrate how competition-driven datasets can push mathematical reasoning boundaries, and JustLogic[7] explores formal logic domains. The ongoing challenge remains how to efficiently scale trace generation while maintaining the step-by-step coherence and correctness that enable models to learn robust reasoning strategies rather than superficial pattern matching.

Claimed Contributions

OpenThoughts3-1.2M dataset and systematic data generation pipeline

10 retrieved papers

The authors develop a systematic data generation pipeline through over 1,000 controlled experiments, investigating each step including question sourcing, mixing, filtering, deduplication, answer sampling, answer filtering, and teacher model selection. This pipeline produces OpenThoughts3-1.2M, a dataset of 1.2 million examples for training reasoning models across math, code, and science domains.

10 retrieved papers

OpenThinker3-7B state-of-the-art reasoning model

10 retrieved papers

The authors train OpenThinker3-7B by fine-tuning Qwen2.5-7B-Instruct on their OpenThoughts3-1.2M dataset, achieving state-of-the-art performance among open-data reasoning models at the 7B scale, with improvements of 15.3, 17.2, and 20.5 percentage points over DeepSeek-R1-Distill-Qwen-7B on key benchmarks.

10 retrieved papers

Empirical insights on reasoning data curation strategies

10 retrieved papers

Through systematic ablation studies, the authors discover several key findings: sampling multiple answers per question (16×) effectively increases dataset scale; weaker teacher models like QwQ-32B can outperform stronger ones like DeepSeek-R1; answer filtering provides minimal benefit; selecting questions from 1-2 high-quality sources outperforms mixing many sources; and LLM-based question filtering (difficulty, response length) outperforms classical methods like fastText.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset PDF

Ivan Moshkov, Darragh Hanley, Toshniwal, Shubham, Ivan Sorokin, Henkel, Christof, Shubham Toshniwal, Christof Henkel, Du Wei, Benedikt Schifferer, Gitman, Igor, Wei Du, Igor Gitman (2025)

[13] Reasonmed: A 370k multi-agent generated dataset for advancing medical reasoning PDF

Sun Yu, Qian Xing-yu, Yu Sun, Xu Wei-wen, Xingyu Qian, Zhang Hao, Weiwen Xu, Xiao Cheng-hao, Hao Zhang, Li Long, Chenghao Xiao, Zhao, Deli, Long Li, Huang, Wenbing, Yu Rong, Xu, Tingyang, Wenbing Huang, Bai Qifeng, Qifeng Bai, Rong Yu, Tingyang Xu (2025)

[14] OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design PDF

Jinghua Wang, Lily Jiaxin Wan, Sanjana Pingali, Scott Smith, Manvi Jha, Xing Zhao, Shalini Sivakumar, Kaiwen Cao, Deming Chen (2025)

[24] 1.4 million open-source distilled reasoning dataset to empower large language model training PDF

Zhao Han, Wang, Haotian, Han Zhao, Peng Yi-ping, Haotian Wang, Zhao Si-tong, Yiping Peng, Tian Xiao-yu, Sitong Zhao, Xiaoyu Tian, Ji Yunjie, Shuaiting Chen, Li, Xiangang, Yunjie Ji, Xiangang Li (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

OpenThoughts3-1.2M dataset and systematic data generation pipeline

[5] Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps PDF

Cannot Refute

[11] SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond PDF

Cannot Refute

[28] Long is more important than difficult for training reasoning models PDF

Cannot Refute

[51] Towards large reasoning models: A survey of reinforced reasoning with large language models PDF

Cannot Refute

[61] Synthetic data generation & multi-step rl for reasoning & tool use PDF

Cannot Refute

[62] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

Cannot Refute

[63] Unicorn: Text-Only Data Synthesis for Vision Language Model Training PDF

Cannot Refute

[64] Slr: Automated synthesis for scalable logical reasoning PDF

Cannot Refute

[65] TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data PDF

Cannot Refute

[66] SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning PDF

Cannot Refute

Contribution

OpenThinker3-7B state-of-the-art reasoning model

[67] Internlm-math: Open math large language models toward verifiable reasoning PDF

Cannot Refute

[68] How abilities in large language models are affected by supervised fine-tuning data composition PDF

Cannot Refute

[69] Advancing reasoning in large language models: Promising methods and approaches PDF

Cannot Refute

[70] Can large language models detect errors in long chain-of-thought reasoning? PDF

Cannot Refute

[71] Llm reasoning engine: Specialized training for enhanced mathematical reasoning PDF

Cannot Refute

[72] Solving quantitative reasoning problems with language models PDF

Cannot Refute

[73] Program synthesis with large language models PDF

Cannot Refute

[74] Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct PDF

Cannot Refute

[75] Improving large language model fine-tuning for solving math problems PDF

Cannot Refute

[76] Scitune: Aligning large language models with scientific multimodal instructions PDF

Cannot Refute

Contribution

Empirical insights on reasoning data curation strategies

[51] Towards large reasoning models: A survey of reinforced reasoning with large language models PDF

Cannot Refute

[52] Lisa: Reasoning segmentation via large language model PDF

Cannot Refute

[53] Large language models are reasoning teachers PDF

Cannot Refute

[54] Large language models can self-improve PDF

Cannot Refute

[55] Syndarin: Synthesising datasets for automated reasoning in low-resource languages PDF

Cannot Refute

[56] Learning Theorem Rationale for Improving the Mathematical Reasoning Capability of Large Language Models PDF

Cannot Refute

[57] Towards reasoning ability of small language models PDF

Cannot Refute

[58] PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models PDF

Cannot Refute

[59] Specializing smaller language models towards multi-step reasoning PDF

Cannot Refute

[60] Improve vision language model chain-of-thought reasoning PDF

Cannot Refute

OpenThoughts: Data Recipes for Reasoning Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset PDF

[13] Reasonmed: A 370k multi-agent generated dataset for advancing medical reasoning PDF

[14] OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design PDF

[24] 1.4 million open-source distilled reasoning dataset to empower large language model training PDF

Contribution Analysis

OpenThoughts3-1.2M dataset and systematic data generation pipeline

[5] Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps PDF

[11] SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond PDF

[28] Long is more important than difficult for training reasoning models PDF

[51] Towards large reasoning models: A survey of reinforced reasoning with large language models PDF

[61] Synthetic data generation & multi-step rl for reasoning & tool use PDF

[62] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

[63] Unicorn: Text-Only Data Synthesis for Vision Language Model Training PDF

[64] Slr: Automated synthesis for scalable logical reasoning PDF

[65] TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data PDF

[66] SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning PDF

OpenThinker3-7B state-of-the-art reasoning model

[67] Internlm-math: Open math large language models toward verifiable reasoning PDF

[68] How abilities in large language models are affected by supervised fine-tuning data composition PDF

[69] Advancing reasoning in large language models: Promising methods and approaches PDF

[70] Can large language models detect errors in long chain-of-thought reasoning? PDF

[71] Llm reasoning engine: Specialized training for enhanced mathematical reasoning PDF

[72] Solving quantitative reasoning problems with language models PDF

[73] Program synthesis with large language models PDF

[74] Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct PDF

[75] Improving large language model fine-tuning for solving math problems PDF

[76] Scitune: Aligning large language models with scientific multimodal instructions PDF

Empirical insights on reasoning data curation strategies

[51] Towards large reasoning models: A survey of reinforced reasoning with large language models PDF

[52] Lisa: Reasoning segmentation via large language model PDF

[53] Large language models are reasoning teachers PDF

[54] Large language models can self-improve PDF

[55] Syndarin: Synthesising datasets for automated reasoning in low-resource languages PDF

[56] Learning Theorem Rationale for Improving the Mathematical Reasoning Capability of Large Language Models PDF

[57] Towards reasoning ability of small language models PDF

[58] PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models PDF

[59] Specializing smaller language models towards multi-step reasoning PDF

[60] Improve vision language model chain-of-thought reasoning PDF

Table of Contents