Expanding Reasoning Potential in Foundation Model by Learning Diverse Chains of Thought Patterns

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

large language modelsreasoning potentiallong chain of thoughtreasoning patternchallenging mathematical reasoning

Recent progress in large reasoning models for challenging mathematical reasoning has been driven by reinforcement learning (RL). Incorporating long chain-of-thought (CoT) data during mid-training has also been shown to substantially improve reasoning depth. However, current approaches often utilize CoT data indiscriminately, leaving open the critical question of which data types most effectively enhance model reasoning capabilities. In this paper, we define the foundation model's reasoning potential for the first time as the inverse of the number of independent attempts required to correctly answer the question, which is strongly correlated with the final model performance. We then propose utilizing diverse data enriched with high-value reasoning patterns to expand the reasoning potential. Specifically, we abstract atomic reasoning patterns from CoT sequences, characterized by commonality and inductive capabilities, and use them to construct a core reference set enriched with valuable reasoning patterns. Furthermore, we propose a dual-granularity algorithm involving chains of reasoning patterns and token entropy, efficiently selecting high-value CoT data (CoTP) from the data pool that aligns with the core set, thereby training models to master reasoning effectively. Only 10B-token CoTP data enables the 85A6B Mixture-of-Experts (MoE) model to improve by 9.58% on the challenging AIME 2024 and 2025, and to raise the upper bound of downstream RL performance by 7.81%.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a framework for selecting high-value chain-of-thought data to enhance mathematical reasoning in foundation models. It sits within the 'Data Synthesis and Selection' leaf of the taxonomy, which contains three papers total. This leaf addresses techniques for generating or filtering training data to improve reasoning capabilities, distinguishing itself from adjacent leaves focused on preference learning or supervised fine-tuning with fixed datasets. The relatively small number of sibling papers suggests this is a moderately explored but not overcrowded research direction within the broader training optimization landscape.

The taxonomy reveals that this work connects to several neighboring research areas. The sibling papers Diverse Chains Thought and BoostStep both tackle data curation but emphasize different aspects—diversity of examples versus iterative refinement of supervision signals. Adjacent leaves include 'Preference and Reinforcement Learning' (four papers) and 'Supervised Fine-Tuning Strategies' (three papers), which focus on training algorithms rather than data selection. The scope note clarifies that this leaf excludes training methods using fixed datasets, positioning the paper at the intersection of data engineering and model optimization for reasoning tasks.

Among the 30 candidates examined through semantic search, none were found to clearly refute any of the three contributions. For the theoretical definition of reasoning potential, 10 candidates were reviewed with zero refutable matches. Similarly, the abstraction of atomic reasoning patterns and the dual-granularity selection algorithm each had 10 candidates examined with no clear prior work overlap. This suggests that within the limited search scope, the paper's specific formulations—particularly the inverse-attempts metric for reasoning potential and the pattern-entropy dual criterion—appear relatively novel compared to the retrieved literature.

The analysis indicates that the paper's contributions occupy a distinct position within the examined literature, though the search was constrained to 30 top-K semantic matches. The absence of refutable candidates across all three contributions, combined with the moderately populated taxonomy leaf, suggests the work introduces fresh perspectives on data selection for reasoning. However, the limited search scope means potentially relevant work outside the top-30 semantic neighborhood may not have been captured, and a broader literature review could reveal additional connections or overlaps.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Selecting high-value chain-of-thought data for mathematical reasoning. The field has organized itself around five major branches that collectively address how to elicit, verify, train, and evaluate reasoning in mathematical domains. Chain-of-Thought Generation and Prompting Techniques explores methods for producing intermediate reasoning steps, ranging from zero-shot prompting strategies like MathPrompter[6] to structured generation approaches such as Chain of Knowledge[1]. Reasoning Verification and Supervision focuses on assessing the correctness of reasoning traces, with works like Verify Step by Step[7] and Automated Process Supervision[8] developing process-level reward models, while GenPRM[10] and ProcessBench[13] refine supervision signals at finer granularities. Training and Optimization for Mathematical Reasoning encompasses data synthesis, selection, and learning algorithms—including preference optimization methods like Full Step DPO[25] and Step KTO[33]—that leverage verified reasoning chains to improve model performance. Search and Inference-Time Optimization investigates test-time strategies such as rStar Math[2] and Q Star[18], which combine tree search with learned value functions to explore solution spaces more effectively. Finally, Evaluation and Analysis of Mathematical Reasoning provides benchmarks like MATH Dataset[24] and Mathverse[4] to measure progress and understand model capabilities. Within the Training and Optimization branch, a particularly active line of work centers on data synthesis and selection, where the challenge is to identify or generate reasoning traces that maximize learning efficiency. Diverse Chains Thought[0] addresses this by curating varied chain-of-thought examples to enhance training diversity, positioning itself alongside efforts like BoostStep[16] and Jiuzhang[19] that also emphasize strategic data curation for mathematical problem-solving. While BoostStep[16] focuses on iteratively refining step-level supervision signals and Jiuzhang[19] integrates domain-specific heuristics for Chinese mathematical reasoning, Diverse Chains Thought[0] emphasizes the breadth of reasoning patterns captured in the training set. This contrasts with approaches like Adaptive Reasoning[5], which dynamically adjusts reasoning strategies at inference time rather than curating training data upfront. The interplay between selecting high-quality training examples and designing effective verification mechanisms remains a central open question, as the value of a reasoning trace depends both on its correctness and its pedagogical utility for model learning.

Claimed Contributions

Theoretical definition of reasoning potential in foundation models

10 retrieved papers

The authors introduce a formal definition of reasoning potential as the probability that a model generates the correct answer when sampling, which is inversely related to the expected number of attempts needed to solve a question. This theoretical framework provides a principled way to measure and optimize model reasoning capabilities.

10 retrieved papers

Abstraction of atomic reasoning patterns from CoT sequences

10 retrieved papers

The authors propose extracting atomic reasoning patterns that exhibit commonality and inductive capabilities from chain-of-thought data. These patterns are used to build a core reference set that approximates oracle reasoning data and guides the selection of high-value training samples.

10 retrieved papers

Dual-granularity algorithm for selecting high-value CoT data

10 retrieved papers

The authors develop an algorithm using weighted Dynamic Time Warping that operates at two levels of granularity (reasoning pattern chains and token entropy) to efficiently select long chain-of-thought data from a source pool that matches valuable reasoning patterns in the core set.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[16] BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning PDF

Zhang Beichen, Liu Yuhong, Beichen Zhang, Dong, Xiaoyi, Yuhong Liu, Zang, Yuhang, Xiao-wen Dong, Zhang Pan, Yuhang Zang, Duan, Haodong, Pan Zhang, Cao, Haodong Duan, Lin, Dahua, Yuhang Cao, Wang, Jiaqi, Dahua Lin, Jiaqi Wang (2025) • arXiv.org

[19] Jiuzhang3. 0: Efficiently improving mathematical reasoning by training small data synthesis models PDF

Zhou, Kun, Zhang Beichen, Kun Zhou, Wang JiaPeng, Beichen Zhang, Chen Zhipeng, Jiapeng Wang, Zhao, Wayne Xin, Zhipeng Chen, Sha Jing, Wayne Xin Zhao, Sheng Zhichao, Jing Sha, Wang Shijin, Zhichao Sheng, Wen, Ji-Rong, Shijin Wang, Ji-Rong Wen (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical definition of reasoning potential in foundation models

[51] Codeplan: Unlocking reasoning potential in large language models by scaling code-form planning PDF

Cannot Refute

[52] Soft thinking: Unlocking the reasoning potential of llms in continuous concept space PDF

Cannot Refute

[53] Understanding reasoning ability of language models from the perspective of reasoning paths aggregation PDF

Cannot Refute

[54] Applying large language models and chain-of-thought for automatic scoring PDF

Cannot Refute

[55] Unlocking reasoning capabilities in llms via reinforcement learning exploration PDF

Cannot Refute

[56] Calibrating Large Language Models with Sample Consistency PDF

Cannot Refute

[57] Why think step by step? reasoning emerges from the locality of experience PDF

Cannot Refute

[58] Reasoning over uncertain text by generative large language models PDF

Cannot Refute

[59] Self-Consistency Improves Chain of Thought Reasoning in Language Models PDF

Cannot Refute

[60] What are the odds? language models are capable of probabilistic reasoning PDF

Cannot Refute

Contribution

Abstraction of atomic reasoning patterns from CoT sequences

[61] Demystifying long chain-of-thought reasoning in llms PDF

Cannot Refute

[62] Multimodal chain-of-thought reasoning: A comprehensive survey PDF

Cannot Refute

[63] Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning PDF

Cannot Refute

[64] The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning PDF

Cannot Refute

[65] Motion-R1: Enhancing Motion Generation with Decomposed Chain-of-Thought and RL Binding PDF

Cannot Refute

[66] What makes a good reasoning chain? uncovering structural patterns in long chain-of-thought reasoning PDF

Cannot Refute

[67] Compressing chain-of-thought in llms via step entropy PDF

Cannot Refute

[68] Learning to Rank Chain-of-Thought: Using a Small Model PDF

Cannot Refute

[69] Beyond imitation: Learning key reasoning steps from dual chain-of-thoughts in reasoning distillation PDF

Cannot Refute

[70] WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback PDF

Cannot Refute

Contribution

Dual-granularity algorithm for selecting high-value CoT data

[67] Compressing chain-of-thought in llms via step entropy PDF

Cannot Refute

[71] Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning PDF

Cannot Refute

[72] Measuring reasoning utility in llms via conditional entropy reduction PDF

Cannot Refute

[73] Lemur: Log parsing with entropy sampling and chain-of-thought merging PDF

Cannot Refute

[74] Prompt Mining for Language-based Human Mobility Forecasting PDF

Cannot Refute

[75] CTRLS: Chain-of-Thought Reasoning via Latent State-Transition PDF

Cannot Refute

[76] Entropy-Guided Tree of Thoughts: A Dynamic Approach to Diverse Path Generation in LLM Reasoning PDF

Cannot Refute

[77] Consistency Is Not Always Correct: Towards Understanding the Role of Exploration in Post-Training Reasoning PDF

Cannot Refute

[78] Fine-Tuning LLMs to Analyze Multiple Dimensions of Code Review: A Maximum Entropy Regulated Long Chain-of-Thought Approach PDF

Cannot Refute

[79] Zero-Shot Cross-Domain Aspect-Based Sentiment Analysis via Domain-Contextualized Chain-of-Thought Reasoning PDF

Cannot Refute

Expanding Reasoning Potential in Foundation Model by Learning Diverse Chains of Thought Patterns

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[16] BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning PDF

[19] Jiuzhang3. 0: Efficiently improving mathematical reasoning by training small data synthesis models PDF

Contribution Analysis

Theoretical definition of reasoning potential in foundation models

[51] Codeplan: Unlocking reasoning potential in large language models by scaling code-form planning PDF

[52] Soft thinking: Unlocking the reasoning potential of llms in continuous concept space PDF

[53] Understanding reasoning ability of language models from the perspective of reasoning paths aggregation PDF

[54] Applying large language models and chain-of-thought for automatic scoring PDF

[55] Unlocking reasoning capabilities in llms via reinforcement learning exploration PDF

[56] Calibrating Large Language Models with Sample Consistency PDF

[57] Why think step by step? reasoning emerges from the locality of experience PDF

[58] Reasoning over uncertain text by generative large language models PDF

[59] Self-Consistency Improves Chain of Thought Reasoning in Language Models PDF

[60] What are the odds? language models are capable of probabilistic reasoning PDF

Abstraction of atomic reasoning patterns from CoT sequences

[61] Demystifying long chain-of-thought reasoning in llms PDF

[62] Multimodal chain-of-thought reasoning: A comprehensive survey PDF

[63] Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning PDF

[64] The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning PDF

[65] Motion-R1: Enhancing Motion Generation with Decomposed Chain-of-Thought and RL Binding PDF

[66] What makes a good reasoning chain? uncovering structural patterns in long chain-of-thought reasoning PDF

[67] Compressing chain-of-thought in llms via step entropy PDF

[68] Learning to Rank Chain-of-Thought: Using a Small Model PDF

[69] Beyond imitation: Learning key reasoning steps from dual chain-of-thoughts in reasoning distillation PDF

[70] WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback PDF

Dual-granularity algorithm for selecting high-value CoT data

[67] Compressing chain-of-thought in llms via step entropy PDF

[71] Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning PDF

[72] Measuring reasoning utility in llms via conditional entropy reduction PDF

[73] Lemur: Log parsing with entropy sampling and chain-of-thought merging PDF

[74] Prompt Mining for Language-based Human Mobility Forecasting PDF

[75] CTRLS: Chain-of-Thought Reasoning via Latent State-Transition PDF

[76] Entropy-Guided Tree of Thoughts: A Dynamic Approach to Diverse Path Generation in LLM Reasoning PDF

[77] Consistency Is Not Always Correct: Towards Understanding the Role of Exploration in Post-Training Reasoning PDF

[78] Fine-Tuning LLMs to Analyze Multiple Dimensions of Code Review: A Maximum Entropy Regulated Long Chain-of-Thought Approach PDF

[79] Zero-Shot Cross-Domain Aspect-Based Sentiment Analysis via Domain-Contextualized Chain-of-Thought Reasoning PDF

Table of Contents