OpenThoughts: Data Recipes for Reasoning Models

ICLR 2026 Conference SubmissionAnonymous Authors
ReasoningDataLLM
Abstract:

Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best train- ing recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. Our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data genera- tion pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Dia- mond – improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on ANONYMIZED.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes OpenThoughts3-1.2M, a systematically refined dataset for training reasoning models, and OpenThinker3-7B, which achieves state-of-the-art results on AIME, LiveCodeBench, and GPQA Diamond. It resides in the 'Model Distillation and Trace Generation' leaf, which contains five papers total, including the original work. This leaf sits within the broader 'Dataset Construction Methodologies' branch, indicating a moderately populated research direction focused on extracting reasoning traces from stronger models rather than synthetic generation or human annotation.

The taxonomy reveals neighboring leaves such as 'Synthetic Data Generation' (five papers) and 'Multi-Agent and Iterative Refinement' (three papers), both under the same parent branch. The paper's distillation-based approach contrasts with template-driven synthesis methods in the former and multi-agent verification strategies in the latter. Its sibling papers include OpenThoughts (earlier version), Distilled Reasoning Million, ReasonMed, and OpenRTLSet, which collectively explore distillation at scale, domain adaptation, and hardware-specific reasoning. The taxonomy structure suggests this is an active but not overcrowded subfield within dataset construction.

Among 30 candidates examined, none clearly refute the three main contributions. For the OpenThoughts3-1.2M dataset and pipeline, 10 candidates were reviewed with zero refutable overlaps. Similarly, the OpenThinker3-7B model and empirical insights on data curation each examined 10 candidates without finding prior work that directly overlaps. This limited search scope suggests the specific combination of systematic pipeline investigation (1,000+ experiments), QwQ-32B distillation, and the resulting performance gains may be novel within the examined literature, though the analysis does not cover the full field.

Based on top-30 semantic matches, the work appears to advance distillation-based dataset construction through systematic experimentation and achieves notable empirical results. However, the search scope is constrained, and the taxonomy shows this is an established research direction with multiple concurrent efforts. The novelty likely lies in the methodological rigor of pipeline optimization and the specific performance improvements demonstrated, rather than introducing entirely new conceptual approaches to reasoning dataset creation.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: creating open-source datasets for training reasoning models. The field organizes around four main branches that reflect different aspects of dataset development. Dataset Construction Methodologies focuses on techniques for generating high-quality reasoning traces, including model distillation approaches that extract intermediate steps from capable models, synthetic data generation methods, and human annotation frameworks. Domain-Specific Reasoning Datasets targets specialized areas such as mathematics, science, medicine, and finance, where domain expertise shapes both problem formulation and solution strategies. General Reasoning Datasets and Benchmarks encompasses broader logical reasoning, multi-hop question answering, and cross-domain evaluation suites that test transferable reasoning skills. Training Frameworks and Model Architectures addresses how datasets integrate with learning paradigms, including reinforcement learning from reasoning traces and architectural innovations that leverage structured thought processes. Representative works like OpenThoughts[0] and Open Reasoner Zero[1] illustrate distillation-based construction, while Llama Nemotron[3] and Openr Framework[2] demonstrate end-to-end training pipelines. A central tension across these branches involves balancing scale, diversity, and trace quality: distillation methods can rapidly produce large volumes of reasoning steps but may inherit biases from teacher models, whereas human-curated datasets offer higher fidelity at greater cost. Within the Model Distillation and Trace Generation cluster, OpenThoughts[0] emphasizes extracting diverse reasoning patterns from frontier models to create broadly applicable training data, positioning itself alongside efforts like Distilled Reasoning Million[24] that prioritize volume and ReasonMed[13] that targets domain adaptation. Compared to OpenRTLSet[14], which focuses on hardware-specific reasoning, OpenThoughts[0] pursues more general-purpose trace generation. Meanwhile, works like AIMO Winner[6] demonstrate how competition-driven datasets can push mathematical reasoning boundaries, and JustLogic[7] explores formal logic domains. The ongoing challenge remains how to efficiently scale trace generation while maintaining the step-by-step coherence and correctness that enable models to learn robust reasoning strategies rather than superficial pattern matching.

Claimed Contributions

OpenThoughts3-1.2M dataset and systematic data generation pipeline

The authors develop a systematic data generation pipeline through over 1,000 controlled experiments, investigating each step including question sourcing, mixing, filtering, deduplication, answer sampling, answer filtering, and teacher model selection. This pipeline produces OpenThoughts3-1.2M, a dataset of 1.2 million examples for training reasoning models across math, code, and science domains.

10 retrieved papers
OpenThinker3-7B state-of-the-art reasoning model

The authors train OpenThinker3-7B by fine-tuning Qwen2.5-7B-Instruct on their OpenThoughts3-1.2M dataset, achieving state-of-the-art performance among open-data reasoning models at the 7B scale, with improvements of 15.3, 17.2, and 20.5 percentage points over DeepSeek-R1-Distill-Qwen-7B on key benchmarks.

10 retrieved papers
Empirical insights on reasoning data curation strategies

Through systematic ablation studies, the authors discover several key findings: sampling multiple answers per question (16×) effectively increases dataset scale; weaker teacher models like QwQ-32B can outperform stronger ones like DeepSeek-R1; answer filtering provides minimal benefit; selecting questions from 1-2 high-quality sources outperforms mixing many sources; and LLM-based question filtering (difficulty, response length) outperforms classical methods like fastText.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

OpenThoughts3-1.2M dataset and systematic data generation pipeline

The authors develop a systematic data generation pipeline through over 1,000 controlled experiments, investigating each step including question sourcing, mixing, filtering, deduplication, answer sampling, answer filtering, and teacher model selection. This pipeline produces OpenThoughts3-1.2M, a dataset of 1.2 million examples for training reasoning models across math, code, and science domains.

Contribution

OpenThinker3-7B state-of-the-art reasoning model

The authors train OpenThinker3-7B by fine-tuning Qwen2.5-7B-Instruct on their OpenThoughts3-1.2M dataset, achieving state-of-the-art performance among open-data reasoning models at the 7B scale, with improvements of 15.3, 17.2, and 20.5 percentage points over DeepSeek-R1-Distill-Qwen-7B on key benchmarks.

Contribution

Empirical insights on reasoning data curation strategies

Through systematic ablation studies, the authors discover several key findings: sampling multiple answers per question (16×) effectively increases dataset scale; weaker teacher models like QwQ-32B can outperform stronger ones like DeepSeek-R1; answer filtering provides minimal benefit; selecting questions from 1-2 high-quality sources outperforms mixing many sources; and LLM-based question filtering (difficulty, response length) outperforms classical methods like fastText.