The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation

ICLR 2026 Conference SubmissionAnonymous Authors
Chain-of-ThoughtKnowledge DistillationLarge Language ModelsBenchmarkingData AugmentationData SelectionData Mixing
Abstract:

Data-centric distillation, including data augmentation, selection, and mixing, offers a promising path to creating smaller, more efficient student Large Language Models (LLMs) that retain strong reasoning abilities. However, there still lacks a comprehensive benchmark to systematically assess the effect of each distillation approach. This paper introduces DC-CoT, the first data-centric benchmark that investigates data manipulation in chain-of-thought (CoT) distillation from method, model and data perspectives. Utilizing various teacher models (e.g., o4-mini, Gemini-Pro, Claude-3.5) and student architectures (e.g., 3B, 7B parameters), we rigorously evaluate the impact of these data manipulations on student model performance across multiple reasoning datasets, with a focus on in-distribution (IID) and out-of-distribution (OOD) generalization, and cross-domain transfer. Our findings aim to provide actionable insights and establish best practices for optimizing CoT distillation through data-centric techniques, ultimately facilitating the development of more accessible and capable reasoning models. The nonymous codebase can be accessed https://anonymous.4open.science/r/DC-COT-FF4C/

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DC-CoT, a comprehensive benchmark for evaluating data-centric chain-of-thought distillation strategies across multiple teacher models and student architectures. It resides in the Difficulty-Aware and Adaptive Data Selection leaf, which contains only three papers including this work. This leaf sits within the broader Distillation Data Generation and Curation branch, one of nine major research directions in the taxonomy. The relatively sparse population of this specific leaf suggests that systematic benchmarking of adaptive data selection remains an underexplored area, though the parent branch itself is well-developed with four distinct sub-areas addressing different aspects of data curation.

The taxonomy reveals substantial activity in neighboring areas. The sibling leaves—Teacher Model Selection and Data Synthesis, Data Quality and Filtering Mechanisms, and Data Augmentation and Diversification—collectively contain eight papers addressing complementary aspects of data preparation. Adjacent branches like Distillation Training Strategies and Model Architecture Adaptations are more densely populated, indicating that the field has historically emphasized optimization techniques and architectural modifications over systematic data manipulation studies. The scope notes clarify that this work's focus on data-centric evaluation distinguishes it from training procedure innovations or architectural changes explored elsewhere in the taxonomy.

Among thirty candidates examined, the benchmark contribution (Contribution 1) shows no clear refutation across ten papers reviewed, while the empirical evaluation component (Contribution 2) encountered three potentially overlapping works among ten candidates. The actionable guidelines contribution (Contribution 3) similarly found no refuting prior work in its ten-candidate examination. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The presence of refutable candidates for the empirical evaluation suggests that comparative studies of distillation strategies exist, though the benchmark framework itself appears less directly anticipated by prior work within this search window.

Based on the thirty-candidate search, the work appears to occupy a relatively novel position as a systematic benchmarking effort in an area where individual methods have been proposed but comprehensive comparative frameworks remain scarce. The taxonomy structure confirms that while data-centric techniques are recognized as important, the specific combination of benchmark design and multi-dimensional evaluation across teacher-student configurations has limited direct precedent among the examined candidates. The analysis necessarily reflects the bounded search scope and cannot rule out relevant work outside the top-K semantic neighborhood.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: data-centric chain-of-thought distillation for language models. The field has evolved into a rich taxonomy with nine major branches, each addressing distinct facets of transferring reasoning capabilities from large teacher models to smaller students. Distillation Data Generation and Curation focuses on how to select, filter, and synthesize high-quality reasoning traces, often emphasizing difficulty-aware sampling or adaptive data selection strategies. Distillation Training Strategies explores optimization techniques and loss formulations that best preserve reasoning fidelity. Model Architecture Adaptations investigates structural modifications that enable efficient reasoning in compact models, while Cross-Model and Cross-Domain Distillation examines transfer across different model families or task settings. Specialized Application Domains tailors distillation to areas such as vision-language tasks, embodied agents, or scientific reasoning. Reasoning Enhancement and Augmentation develops methods to refine or augment chain-of-thought traces themselves, and Privacy-Preserving and Federated Distillation addresses decentralized or secure distillation scenarios. Finally, Interpretability and Explainability and Empirical Analysis and Mechanistic Studies provide deeper insights into why and how distillation works. Within the Distillation Data Generation and Curation branch, a particularly active line of work centers on difficulty-aware and adaptive data selection, where researchers seek to identify which reasoning examples most benefit student models. The Quest for Efficient[0] sits squarely in this cluster, emphasizing adaptive curation strategies that prioritize informative or challenging instances. Nearby works such as Adaptive Chain-of-Thought Distillation Based[36] and DA-CoTD[49] similarly explore dynamic selection mechanisms, though they may differ in how they measure difficulty or adapt sampling over training. In contrast, other branches like Reasoning Enhancement and Augmentation focus less on data selection and more on refining the reasoning traces themselves through techniques like symbolic distillation (Symbolic Chain-of-Thought Distillation[3]) or progressive keypoint extraction. The central tension across these directions is balancing data efficiency—using fewer but higher-quality examples—with coverage of diverse reasoning patterns, a trade-off that remains an open question as models scale and tasks diversify.

Claimed Contributions

DC-CoT: A data-centric benchmark for CoT distillation

The authors introduce DC-CoT, the first comprehensive benchmark designed to systematically investigate data-centric manipulations (augmentation, selection, and mixing) in chain-of-thought distillation. The benchmark evaluates these manipulations across multiple teacher models, student architectures, and reasoning datasets to assess their impact on model performance.

10 retrieved papers
Extensive empirical evaluation of CoT distillation strategies

The authors perform comprehensive experiments using various teacher models (GPT-4o, Claude-3.5, Gemini-Pro) and student architectures (3B-8B parameters) across textual, agentic, and visual reasoning tasks. This provides the first systematic large-scale empirical analysis of how different factors influence CoT distillation effectiveness.

10 retrieved papers
Can Refute
Actionable guidelines for effective CoT distillation

The authors derive practical guidelines from their experiments, specifying which data-centric techniques work best for different task types. For example, they identify that Reverse Thinking excels for structured logic tasks, Answer Augmentation suits open-ended linguistic tasks, and LLM-as-a-Judge filtering is necessary for agentic and visual tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DC-CoT: A data-centric benchmark for CoT distillation

The authors introduce DC-CoT, the first comprehensive benchmark designed to systematically investigate data-centric manipulations (augmentation, selection, and mixing) in chain-of-thought distillation. The benchmark evaluates these manipulations across multiple teacher models, student architectures, and reasoning datasets to assess their impact on model performance.

Contribution

Extensive empirical evaluation of CoT distillation strategies

The authors perform comprehensive experiments using various teacher models (GPT-4o, Claude-3.5, Gemini-Pro) and student architectures (3B-8B parameters) across textual, agentic, and visual reasoning tasks. This provides the first systematic large-scale empirical analysis of how different factors influence CoT distillation effectiveness.

Contribution

Actionable guidelines for effective CoT distillation

The authors derive practical guidelines from their experiments, specifying which data-centric techniques work best for different task types. For example, they identify that Reverse Thinking excels for structured logic tasks, Answer Augmentation suits open-ended linguistic tasks, and LLM-as-a-Judge filtering is necessary for agentic and visual tasks.