OpenThoughts: Data Recipes for Reasoning Models
Overview
Overall Novelty Assessment
The paper contributes OpenThoughts3-1.2M, a systematically refined dataset for training reasoning models, and OpenThinker3-7B, which achieves state-of-the-art results on AIME, LiveCodeBench, and GPQA Diamond. It resides in the 'Model Distillation and Trace Generation' leaf, which contains five papers total, including the original work. This leaf sits within the broader 'Dataset Construction Methodologies' branch, indicating a moderately populated research direction focused on extracting reasoning traces from stronger models rather than synthetic generation or human annotation.
The taxonomy reveals neighboring leaves such as 'Synthetic Data Generation' (five papers) and 'Multi-Agent and Iterative Refinement' (three papers), both under the same parent branch. The paper's distillation-based approach contrasts with template-driven synthesis methods in the former and multi-agent verification strategies in the latter. Its sibling papers include OpenThoughts (earlier version), Distilled Reasoning Million, ReasonMed, and OpenRTLSet, which collectively explore distillation at scale, domain adaptation, and hardware-specific reasoning. The taxonomy structure suggests this is an active but not overcrowded subfield within dataset construction.
Among 30 candidates examined, none clearly refute the three main contributions. For the OpenThoughts3-1.2M dataset and pipeline, 10 candidates were reviewed with zero refutable overlaps. Similarly, the OpenThinker3-7B model and empirical insights on data curation each examined 10 candidates without finding prior work that directly overlaps. This limited search scope suggests the specific combination of systematic pipeline investigation (1,000+ experiments), QwQ-32B distillation, and the resulting performance gains may be novel within the examined literature, though the analysis does not cover the full field.
Based on top-30 semantic matches, the work appears to advance distillation-based dataset construction through systematic experimentation and achieves notable empirical results. However, the search scope is constrained, and the taxonomy shows this is an established research direction with multiple concurrent efforts. The novelty likely lies in the methodological rigor of pipeline optimization and the specific performance improvements demonstrated, rather than introducing entirely new conceptual approaches to reasoning dataset creation.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors develop a systematic data generation pipeline through over 1,000 controlled experiments, investigating each step including question sourcing, mixing, filtering, deduplication, answer sampling, answer filtering, and teacher model selection. This pipeline produces OpenThoughts3-1.2M, a dataset of 1.2 million examples for training reasoning models across math, code, and science domains.
The authors train OpenThinker3-7B by fine-tuning Qwen2.5-7B-Instruct on their OpenThoughts3-1.2M dataset, achieving state-of-the-art performance among open-data reasoning models at the 7B scale, with improvements of 15.3, 17.2, and 20.5 percentage points over DeepSeek-R1-Distill-Qwen-7B on key benchmarks.
Through systematic ablation studies, the authors discover several key findings: sampling multiple answers per question (16×) effectively increases dataset scale; weaker teacher models like QwQ-32B can outperform stronger ones like DeepSeek-R1; answer filtering provides minimal benefit; selecting questions from 1-2 high-quality sources outperforms mixing many sources; and LLM-based question filtering (difficulty, response length) outperforms classical methods like fastText.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[6] Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset PDF
[13] Reasonmed: A 370k multi-agent generated dataset for advancing medical reasoning PDF
[14] OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design PDF
[24] 1.4 million open-source distilled reasoning dataset to empower large language model training PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
OpenThoughts3-1.2M dataset and systematic data generation pipeline
The authors develop a systematic data generation pipeline through over 1,000 controlled experiments, investigating each step including question sourcing, mixing, filtering, deduplication, answer sampling, answer filtering, and teacher model selection. This pipeline produces OpenThoughts3-1.2M, a dataset of 1.2 million examples for training reasoning models across math, code, and science domains.
[5] Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps PDF
[11] SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond PDF
[28] Long is more important than difficult for training reasoning models PDF
[51] Towards large reasoning models: A survey of reinforced reasoning with large language models PDF
[61] Synthetic data generation & multi-step rl for reasoning & tool use PDF
[62] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF
[63] Unicorn: Text-Only Data Synthesis for Vision Language Model Training PDF
[64] Slr: Automated synthesis for scalable logical reasoning PDF
[65] TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data PDF
[66] SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning PDF
OpenThinker3-7B state-of-the-art reasoning model
The authors train OpenThinker3-7B by fine-tuning Qwen2.5-7B-Instruct on their OpenThoughts3-1.2M dataset, achieving state-of-the-art performance among open-data reasoning models at the 7B scale, with improvements of 15.3, 17.2, and 20.5 percentage points over DeepSeek-R1-Distill-Qwen-7B on key benchmarks.
[67] Internlm-math: Open math large language models toward verifiable reasoning PDF
[68] How abilities in large language models are affected by supervised fine-tuning data composition PDF
[69] Advancing reasoning in large language models: Promising methods and approaches PDF
[70] Can large language models detect errors in long chain-of-thought reasoning? PDF
[71] Llm reasoning engine: Specialized training for enhanced mathematical reasoning PDF
[72] Solving quantitative reasoning problems with language models PDF
[73] Program synthesis with large language models PDF
[74] Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct PDF
[75] Improving large language model fine-tuning for solving math problems PDF
[76] Scitune: Aligning large language models with scientific multimodal instructions PDF
Empirical insights on reasoning data curation strategies
Through systematic ablation studies, the authors discover several key findings: sampling multiple answers per question (16×) effectively increases dataset scale; weaker teacher models like QwQ-32B can outperform stronger ones like DeepSeek-R1; answer filtering provides minimal benefit; selecting questions from 1-2 high-quality sources outperforms mixing many sources; and LLM-based question filtering (difficulty, response length) outperforms classical methods like fastText.