Abstract:

The paradigm shift in large language models (LLMs) from instinctive responses to chain-of-thought (CoT) reasoning has fueled two prevailing assumptions: (1) reasoning capabilities only emerge in sufficiently large models, and (2) such capabilities require training on massive datasets. While the first assumption has already been challenged by recent sub-billion-parameter reasoning models such as Qwen3-0.6B and DeepSeek distilled variants, the second remains largely unquestioned. In this work, we revisit the necessity of scaling to extremely large corpora (>10T tokens) for reasoning emergence. By carefully curating and resampling open-source datasets that we identify as beneficial under our designed metrics, we demonstrate that strong reasoning abilities can emerge with far less data. Specifically, we show that only ~2T tokens of high-quality data are sufficient, and pre-training with 4.2T tokens on the dataset resampled from these ~2T tokens, followed by a established post-training procedure, enables the development of X-LLM-R1, a series of sub-billion-parameter reasoning models that substantially outperform prior models trained on fully open-sourced data. For example, X-LLM-R1-950M achieves an AIME score of 15.5, compared to just 0.6 for OLMo-2-1.48B and 0.3 for SmolLM-2-1.7B. Remarkably, despite being trained on only 11.7% of the tokens compared to Qwen3’s proprietary 36T-token corpus for pretraining, X-LLM-R1-950M matches or surpasses Qwen3-0.6B across multiple reasoning benchmarks. To facilitate further research in this direction, we release the complete training recipe, data sources, data mixing ratio, and model checkpoints, together with the key insights obtained throughout this study.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes X-LLM-R1, a series of sub-billion-parameter reasoning models trained on approximately 2T tokens of curated data, challenging the assumption that reasoning emergence requires datasets exceeding 10T tokens. It resides in the Data-Centric Training Strategies leaf under Independent Training Approaches for Small Models, alongside two sibling papers: Sub-Billion Reasoners and Solution Guidance. This leaf represents a focused research direction within the broader taxonomy of 50 papers across 36 topics, emphasizing data curation and resampling over distillation from large teacher models. The concentration of only three papers in this specific leaf suggests a relatively sparse but emerging area of investigation.

The taxonomy reveals that Data-Centric Training Strategies sits within a larger branch of Independent Training Approaches, which also includes Reinforcement Learning for Reasoning and Architectural Innovations for Reasoning. Neighboring branches include Knowledge Distillation from Large to Small Models, which contains five papers on Chain-of-Thought Distillation and four on Specialized Knowledge Distillation, representing a more crowded research direction. The scope note for Data-Centric Training explicitly excludes reinforcement learning and architectural modifications, positioning this work as focused on training data quality rather than algorithmic or structural innovations. This placement suggests the paper diverges from the dominant distillation paradigm toward autonomous data optimization.

Among 29 candidates examined, the analysis identified potential prior work overlap for two of three contributions. The data–model co-evolution strategy for mid-training examined 9 candidates with 1 refutable match, while the X-LLM-R1 model series examined 10 candidates with 1 refutable match. The benchmark-free, self-evolving data optimization contribution examined 10 candidates with no clear refutations. Given the limited search scope of approximately 30 papers, these statistics suggest that the data optimization methodology appears more distinctive, while the co-evolution strategy and model release have closer precedents in the examined literature. The scale of this search does not constitute exhaustive coverage of the field.

Based on top-30 semantic matches and citation expansion, the work appears to occupy a relatively underexplored niche within data-centric training for small reasoning models. The taxonomy structure indicates that while distillation-based approaches dominate the broader field, independent training strategies remain less densely populated. The limited search scope means that additional relevant work may exist beyond the examined candidates, particularly in adjacent areas like synthetic data generation or curriculum learning for reasoning tasks.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Efficient reasoning capability development in sub-billion parameter language models. The field has organized itself around several complementary strategies for building capable small-scale reasoners. Knowledge Distillation from Large to Small Models (e.g., Distilling Reasoning[1], Mathematical Reasoning Distillation[4]) transfers reasoning patterns from powerful teachers to compact students, while Independent Training Approaches for Small Models emphasize data-centric methods and self-improvement without relying on large model supervision. Collaborative and Hybrid Reasoning Systems explore how small models can work alongside larger counterparts or external tools, and Efficiency Optimization branches focus on inference acceleration and resource-constrained deployment. Domain-Specific Small Model Applications target specialized contexts such as healthcare (Healthcare SLMs[17]) or IoT environments (LLMs for IoT[26]), and Evaluation and Analysis branches provide benchmarks and interpretability studies. Advanced Training Frameworks introduce novel optimization paradigms, including reinforcement learning and structured policy methods. Within the Independent Training Approaches, a particularly active line of work centers on Data-Centric Training Strategies that curate high-quality reasoning traces or synthetic data to bootstrap small model performance. Sub-Billion Reasoners[0] exemplifies this direction by systematically generating and filtering reasoning examples tailored to compact architectures, closely aligning with Solution Guidance[2] and Synthetic Thinking[35], which also emphasize crafted training signals over distillation from large teachers. In contrast, distillation-focused methods like Multi-Step Reasoning[3] and Multi-Step Distillation[12] rely heavily on teacher-generated chains of thought, trading off independence for the richness of large model supervision. The central tension across these branches involves balancing data efficiency, computational cost, and the degree of reliance on external large models, with Sub-Billion Reasoners[0] occupying a niche that prioritizes autonomous data generation and scalability within strict parameter budgets.

Claimed Contributions

Benchmark-free, self-evolving data optimization for pre-training data curation

The authors propose a principled dataset-level weighting method that uses cross-domain influence scores to optimize data mixture ratios during pre-training. This approach enables strong reasoning generalization on held-out benchmarks without exposing them during training or data mixture optimization.

10 retrieved papers
Data–model co-evolution strategy for mid-training

The authors introduce an iterative strategy where the model trained on a given data mixture computes influence scores for samples, which are then used to dynamically remove negative influence samples and adjust data sampling ratios for the next phase. This process converges when most samples reach zero or negative influence, indicating the dataset's information has been largely exhausted.

9 retrieved papers
Can Refute
X-LLM-R1 series of sub-billion-parameter reasoning models with complete open training recipe

The authors develop X-LLM-R1, a series of sub-billion-parameter reasoning models trained on only 4.2T tokens (11.7% of Qwen3's 36T tokens) that achieve state-of-the-art results among fully open-source models. They release the complete training recipe, data sources, data mixing ratios, and model checkpoints to enable reproducibility.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Benchmark-free, self-evolving data optimization for pre-training data curation

The authors propose a principled dataset-level weighting method that uses cross-domain influence scores to optimize data mixture ratios during pre-training. This approach enables strong reasoning generalization on held-out benchmarks without exposing them during training or data mixture optimization.

Contribution

Data–model co-evolution strategy for mid-training

The authors introduce an iterative strategy where the model trained on a given data mixture computes influence scores for samples, which are then used to dynamically remove negative influence samples and adjust data sampling ratios for the next phase. This process converges when most samples reach zero or negative influence, indicating the dataset's information has been largely exhausted.

Contribution

X-LLM-R1 series of sub-billion-parameter reasoning models with complete open training recipe

The authors develop X-LLM-R1, a series of sub-billion-parameter reasoning models trained on only 4.2T tokens (11.7% of Qwen3's 36T tokens) that achieve state-of-the-art results among fully open-source models. They release the complete training recipe, data sources, data mixing ratios, and model checkpoints to enable reproducibility.