Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

on-device model

The paradigm shift in large language models (LLMs) from instinctive responses to chain-of-thought (CoT) reasoning has fueled two prevailing assumptions: (1) reasoning capabilities only emerge in sufficiently large models, and (2) such capabilities require training on massive datasets. While the first assumption has already been challenged by recent sub-billion-parameter reasoning models such as Qwen3-0.6B and DeepSeek distilled variants, the second remains largely unquestioned. In this work, we revisit the necessity of scaling to extremely large corpora (>10T tokens) for reasoning emergence. By carefully curating and resampling open-source datasets that we identify as beneficial under our designed metrics, we demonstrate that strong reasoning abilities can emerge with far less data. Specifically, we show that only ~2T tokens of high-quality data are sufficient, and pre-training with 4.2T tokens on the dataset resampled from these ~2T tokens, followed by a established post-training procedure, enables the development of X-LLM-R1, a series of sub-billion-parameter reasoning models that substantially outperform prior models trained on fully open-sourced data. For example, X-LLM-R1-950M achieves an AIME score of 15.5, compared to just 0.6 for OLMo-2-1.48B and 0.3 for SmolLM-2-1.7B. Remarkably, despite being trained on only 11.7% of the tokens compared to Qwen3’s proprietary 36T-token corpus for pretraining, X-LLM-R1-950M matches or surpasses Qwen3-0.6B across multiple reasoning benchmarks. To facilitate further research in this direction, we release the complete training recipe, data sources, data mixing ratio, and model checkpoints, together with the key insights obtained throughout this study.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes X-LLM-R1, a series of sub-billion-parameter reasoning models trained on approximately 2T tokens of curated data, challenging the assumption that reasoning emergence requires datasets exceeding 10T tokens. It resides in the Data-Centric Training Strategies leaf under Independent Training Approaches for Small Models, alongside two sibling papers: Sub-Billion Reasoners and Solution Guidance. This leaf represents a focused research direction within the broader taxonomy of 50 papers across 36 topics, emphasizing data curation and resampling over distillation from large teacher models. The concentration of only three papers in this specific leaf suggests a relatively sparse but emerging area of investigation.

The taxonomy reveals that Data-Centric Training Strategies sits within a larger branch of Independent Training Approaches, which also includes Reinforcement Learning for Reasoning and Architectural Innovations for Reasoning. Neighboring branches include Knowledge Distillation from Large to Small Models, which contains five papers on Chain-of-Thought Distillation and four on Specialized Knowledge Distillation, representing a more crowded research direction. The scope note for Data-Centric Training explicitly excludes reinforcement learning and architectural modifications, positioning this work as focused on training data quality rather than algorithmic or structural innovations. This placement suggests the paper diverges from the dominant distillation paradigm toward autonomous data optimization.

Among 29 candidates examined, the analysis identified potential prior work overlap for two of three contributions. The data–model co-evolution strategy for mid-training examined 9 candidates with 1 refutable match, while the X-LLM-R1 model series examined 10 candidates with 1 refutable match. The benchmark-free, self-evolving data optimization contribution examined 10 candidates with no clear refutations. Given the limited search scope of approximately 30 papers, these statistics suggest that the data optimization methodology appears more distinctive, while the co-evolution strategy and model release have closer precedents in the examined literature. The scale of this search does not constitute exhaustive coverage of the field.

Based on top-30 semantic matches and citation expansion, the work appears to occupy a relatively underexplored niche within data-centric training for small reasoning models. The taxonomy structure indicates that while distillation-based approaches dominate the broader field, independent training strategies remain less densely populated. The limited search scope means that additional relevant work may exist beyond the examined candidates, particularly in adjacent areas like synthetic data generation or curriculum learning for reasoning tasks.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Efficient reasoning capability development in sub-billion parameter language models. The field has organized itself around several complementary strategies for building capable small-scale reasoners. Knowledge Distillation from Large to Small Models (e.g., Distilling Reasoning[1], Mathematical Reasoning Distillation[4]) transfers reasoning patterns from powerful teachers to compact students, while Independent Training Approaches for Small Models emphasize data-centric methods and self-improvement without relying on large model supervision. Collaborative and Hybrid Reasoning Systems explore how small models can work alongside larger counterparts or external tools, and Efficiency Optimization branches focus on inference acceleration and resource-constrained deployment. Domain-Specific Small Model Applications target specialized contexts such as healthcare (Healthcare SLMs[17]) or IoT environments (LLMs for IoT[26]), and Evaluation and Analysis branches provide benchmarks and interpretability studies. Advanced Training Frameworks introduce novel optimization paradigms, including reinforcement learning and structured policy methods. Within the Independent Training Approaches, a particularly active line of work centers on Data-Centric Training Strategies that curate high-quality reasoning traces or synthetic data to bootstrap small model performance. Sub-Billion Reasoners[0] exemplifies this direction by systematically generating and filtering reasoning examples tailored to compact architectures, closely aligning with Solution Guidance[2] and Synthetic Thinking[35], which also emphasize crafted training signals over distillation from large teachers. In contrast, distillation-focused methods like Multi-Step Reasoning[3] and Multi-Step Distillation[12] rely heavily on teacher-generated chains of thought, trading off independence for the richness of large model supervision. The central tension across these branches involves balancing data efficiency, computational cost, and the degree of reliance on external large models, with Sub-Billion Reasoners[0] occupying a niche that prioritizes autonomous data generation and scalability within strict parameter budgets.

Claimed Contributions

Benchmark-free, self-evolving data optimization for pre-training data curation

10 retrieved papers

The authors propose a principled dataset-level weighting method that uses cross-domain influence scores to optimize data mixture ratios during pre-training. This approach enables strong reasoning generalization on held-out benchmarks without exposing them during training or data mixture optimization.

10 retrieved papers

Data–model co-evolution strategy for mid-training

Can Refute

9 retrieved papers

The authors introduce an iterative strategy where the model trained on a given data mixture computes influence scores for samples, which are then used to dynamically remove negative influence samples and adjust data sampling ratios for the next phase. This process converges when most samples reach zero or negative influence, indicating the dataset's information has been largely exhausted.

9 retrieved papers

Can Refute

X-LLM-R1 series of sub-billion-parameter reasoning models with complete open training recipe

9 retrieved papers

The authors develop X-LLM-R1, a series of sub-billion-parameter reasoning models trained on only 4.2T tokens (11.7% of Qwen3's 36T tokens) that achieve state-of-the-art results among fully open-source models. They release the complete training recipe, data sources, data mixing ratios, and model checkpoints to enable reproducibility.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Enhancing the reasoning capabilities of small language models via solution guidance fine-tuning PDF

Bi Jing, Jing Bi, WU Yuting, Yuting Wu, Xing Weiwei, Weiwei Xing, Wei, Zhenjie, Zhenjie Wei (2025)

[35] Inducing thinking capabilities in Large Language Models (LLM) by Synthetic dataset PDF

Sourabh Singh, Gaurav Raj, Abhishek Singh Verma, Avinash Kumar Sharma (2026)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Benchmark-free, self-evolving data optimization for pre-training data curation

[51] TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training PDF

Cannot Refute

[52] Data-efficient pretraining with group-level data influence modeling PDF

Cannot Refute

[53] Cross-domain Learning Framework for Book-Movie Recommendation with RoBERTa and DistilBERT in Action PDF

Cannot Refute

[54] Automatic Instruction Data Selection for Large Language Models via Uncertainty-Aware Influence Maximization PDF

Cannot Refute

[55] Mates: Model-aware data selection for efficient pretraining with data influence models PDF

Cannot Refute

[56] InfFeed: Influence Functions as a Feedback to Improve the Performance of Subjective Tasks PDF

Cannot Refute

[57] Influence scores at scale for efficient language data sampling PDF

Cannot Refute

[58] Farewell to aimless large-scale pretraining: Influential subset selection for language model PDF

Cannot Refute

[59] JI2S: Joint InfluenceâAware Instruction Data Selection for Efficient FineâTuning PDF

Cannot Refute

[60] Layer-Aware Influence for Online Data Valuation Estimation PDF

Cannot Refute

Contribution

Data–model co-evolution strategy for mid-training

[70] Climb: Clustering-based iterative data mixture bootstrapping for language model pre-training PDF

Can Refute

[71] DynImpt: A Dynamic Data Selection Method for Improving Model Training Efficiency PDF

Cannot Refute

[72] Bidirectional Curriculum Learning: Decelerating and Re-accelerating Learning for Robust Convergence PDF

Cannot Refute

[73] Cost-Effective Incremental Deep Model: Matching Model Capacity With the Least Sampling PDF

Cannot Refute

[74] Reinforcement Mid-Training PDF

Cannot Refute

[75] Natural gradient evolution strategies for adaptive sampling PDF

Cannot Refute

[76] A Data-Centric Perspective on the Lifecycle of Large Language Models PDF

Cannot Refute

[77] Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning PDF

Cannot Refute

[78] Nemotron-CLIMB: Clustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training PDF

Cannot Refute

Contribution

X-LLM-R1 series of sub-billion-parameter reasoning models with complete open training recipe

[61] SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model PDF

Cannot Refute

[62] Phi-4-mini-reasoning: Exploring the limits of small reasoning language models in math PDF

Cannot Refute

[63] Tinyllava-video-r1: Towards smaller lmms for video reasoning PDF

Cannot Refute

[64] Agentic large language models improve retrieval-based radiology question answering PDF

Cannot Refute

[65] Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs PDF

Cannot Refute

[66] OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe PDF

Cannot Refute

[67] Retrieval-augmented reasoning with lean language models PDF

Cannot Refute

[68] Few-Shot Vision-Language Reasoning for Satellite Imagery via Verifiable Rewards PDF

Cannot Refute

[69] The Challenge of Teaching Reasoning to LLMs Without RL or Distillation PDF

Cannot Refute

Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Enhancing the reasoning capabilities of small language models via solution guidance fine-tuning PDF

[35] Inducing thinking capabilities in Large Language Models (LLM) by Synthetic dataset PDF

Contribution Analysis

Benchmark-free, self-evolving data optimization for pre-training data curation

[51] TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training PDF

[52] Data-efficient pretraining with group-level data influence modeling PDF

[53] Cross-domain Learning Framework for Book-Movie Recommendation with RoBERTa and DistilBERT in Action PDF

[54] Automatic Instruction Data Selection for Large Language Models via Uncertainty-Aware Influence Maximization PDF

[55] Mates: Model-aware data selection for efficient pretraining with data influence models PDF

[56] InfFeed: Influence Functions as a Feedback to Improve the Performance of Subjective Tasks PDF

[57] Influence scores at scale for efficient language data sampling PDF

[58] Farewell to aimless large-scale pretraining: Influential subset selection for language model PDF

[59] JI2S: Joint InfluenceâAware Instruction Data Selection for Efficient FineâTuning PDF

[60] Layer-Aware Influence for Online Data Valuation Estimation PDF

Data–model co-evolution strategy for mid-training

[70] Climb: Clustering-based iterative data mixture bootstrapping for language model pre-training PDF

[71] DynImpt: A Dynamic Data Selection Method for Improving Model Training Efficiency PDF

[72] Bidirectional Curriculum Learning: Decelerating and Re-accelerating Learning for Robust Convergence PDF

[73] Cost-Effective Incremental Deep Model: Matching Model Capacity With the Least Sampling PDF

[74] Reinforcement Mid-Training PDF

[75] Natural gradient evolution strategies for adaptive sampling PDF

[76] A Data-Centric Perspective on the Lifecycle of Large Language Models PDF

[77] Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning PDF

[78] Nemotron-CLIMB: Clustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training PDF

X-LLM-R1 series of sub-billion-parameter reasoning models with complete open training recipe

[61] SmolLM2: When Smol Goes Big--Data-Centric Training of a Small Language Model PDF

[62] Phi-4-mini-reasoning: Exploring the limits of small reasoning language models in math PDF

[63] Tinyllava-video-r1: Towards smaller lmms for video reasoning PDF

[64] Agentic large language models improve retrieval-based radiology question answering PDF

[65] Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs PDF

[66] OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe PDF

[67] Retrieval-augmented reasoning with lean language models PDF

[68] Few-Shot Vision-Language Reasoning for Satellite Imagery via Verifiable Rewards PDF

[69] The Challenge of Teaching Reasoning to LLMs Without RL or Distillation PDF

Table of Contents

[59] JI2S: Joint InfluenceâAware Instruction Data Selection for Efficient FineâTuning PDF