Expanding Reasoning Potential in Foundation Model by Learning Diverse Chains of Thought Patterns

ICLR 2026 Conference SubmissionAnonymous Authors
large language modelsreasoning potentiallong chain of thoughtreasoning patternchallenging mathematical reasoning
Abstract:

Recent progress in large reasoning models for challenging mathematical reasoning has been driven by reinforcement learning (RL). Incorporating long chain-of-thought (CoT) data during mid-training has also been shown to substantially improve reasoning depth. However, current approaches often utilize CoT data indiscriminately, leaving open the critical question of which data types most effectively enhance model reasoning capabilities. In this paper, we define the foundation model's reasoning potential for the first time as the inverse of the number of independent attempts required to correctly answer the question, which is strongly correlated with the final model performance. We then propose utilizing diverse data enriched with high-value reasoning patterns to expand the reasoning potential. Specifically, we abstract atomic reasoning patterns from CoT sequences, characterized by commonality and inductive capabilities, and use them to construct a core reference set enriched with valuable reasoning patterns. Furthermore, we propose a dual-granularity algorithm involving chains of reasoning patterns and token entropy, efficiently selecting high-value CoT data (CoTP) from the data pool that aligns with the core set, thereby training models to master reasoning effectively. Only 10B-token CoTP data enables the 85A6B Mixture-of-Experts (MoE) model to improve by 9.58% on the challenging AIME 2024 and 2025, and to raise the upper bound of downstream RL performance by 7.81%.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a framework for selecting high-value chain-of-thought data to enhance mathematical reasoning in foundation models. It sits within the 'Data Synthesis and Selection' leaf of the taxonomy, which contains three papers total. This leaf addresses techniques for generating or filtering training data to improve reasoning capabilities, distinguishing itself from adjacent leaves focused on preference learning or supervised fine-tuning with fixed datasets. The relatively small number of sibling papers suggests this is a moderately explored but not overcrowded research direction within the broader training optimization landscape.

The taxonomy reveals that this work connects to several neighboring research areas. The sibling papers Diverse Chains Thought and BoostStep both tackle data curation but emphasize different aspects—diversity of examples versus iterative refinement of supervision signals. Adjacent leaves include 'Preference and Reinforcement Learning' (four papers) and 'Supervised Fine-Tuning Strategies' (three papers), which focus on training algorithms rather than data selection. The scope note clarifies that this leaf excludes training methods using fixed datasets, positioning the paper at the intersection of data engineering and model optimization for reasoning tasks.

Among the 30 candidates examined through semantic search, none were found to clearly refute any of the three contributions. For the theoretical definition of reasoning potential, 10 candidates were reviewed with zero refutable matches. Similarly, the abstraction of atomic reasoning patterns and the dual-granularity selection algorithm each had 10 candidates examined with no clear prior work overlap. This suggests that within the limited search scope, the paper's specific formulations—particularly the inverse-attempts metric for reasoning potential and the pattern-entropy dual criterion—appear relatively novel compared to the retrieved literature.

The analysis indicates that the paper's contributions occupy a distinct position within the examined literature, though the search was constrained to 30 top-K semantic matches. The absence of refutable candidates across all three contributions, combined with the moderately populated taxonomy leaf, suggests the work introduces fresh perspectives on data selection for reasoning. However, the limited search scope means potentially relevant work outside the top-30 semantic neighborhood may not have been captured, and a broader literature review could reveal additional connections or overlaps.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Selecting high-value chain-of-thought data for mathematical reasoning. The field has organized itself around five major branches that collectively address how to elicit, verify, train, and evaluate reasoning in mathematical domains. Chain-of-Thought Generation and Prompting Techniques explores methods for producing intermediate reasoning steps, ranging from zero-shot prompting strategies like MathPrompter[6] to structured generation approaches such as Chain of Knowledge[1]. Reasoning Verification and Supervision focuses on assessing the correctness of reasoning traces, with works like Verify Step by Step[7] and Automated Process Supervision[8] developing process-level reward models, while GenPRM[10] and ProcessBench[13] refine supervision signals at finer granularities. Training and Optimization for Mathematical Reasoning encompasses data synthesis, selection, and learning algorithms—including preference optimization methods like Full Step DPO[25] and Step KTO[33]—that leverage verified reasoning chains to improve model performance. Search and Inference-Time Optimization investigates test-time strategies such as rStar Math[2] and Q Star[18], which combine tree search with learned value functions to explore solution spaces more effectively. Finally, Evaluation and Analysis of Mathematical Reasoning provides benchmarks like MATH Dataset[24] and Mathverse[4] to measure progress and understand model capabilities. Within the Training and Optimization branch, a particularly active line of work centers on data synthesis and selection, where the challenge is to identify or generate reasoning traces that maximize learning efficiency. Diverse Chains Thought[0] addresses this by curating varied chain-of-thought examples to enhance training diversity, positioning itself alongside efforts like BoostStep[16] and Jiuzhang[19] that also emphasize strategic data curation for mathematical problem-solving. While BoostStep[16] focuses on iteratively refining step-level supervision signals and Jiuzhang[19] integrates domain-specific heuristics for Chinese mathematical reasoning, Diverse Chains Thought[0] emphasizes the breadth of reasoning patterns captured in the training set. This contrasts with approaches like Adaptive Reasoning[5], which dynamically adjusts reasoning strategies at inference time rather than curating training data upfront. The interplay between selecting high-quality training examples and designing effective verification mechanisms remains a central open question, as the value of a reasoning trace depends both on its correctness and its pedagogical utility for model learning.

Claimed Contributions

Theoretical definition of reasoning potential in foundation models

The authors introduce a formal definition of reasoning potential as the probability that a model generates the correct answer when sampling, which is inversely related to the expected number of attempts needed to solve a question. This theoretical framework provides a principled way to measure and optimize model reasoning capabilities.

10 retrieved papers
Abstraction of atomic reasoning patterns from CoT sequences

The authors propose extracting atomic reasoning patterns that exhibit commonality and inductive capabilities from chain-of-thought data. These patterns are used to build a core reference set that approximates oracle reasoning data and guides the selection of high-value training samples.

10 retrieved papers
Dual-granularity algorithm for selecting high-value CoT data

The authors develop an algorithm using weighted Dynamic Time Warping that operates at two levels of granularity (reasoning pattern chains and token entropy) to efficiently select long chain-of-thought data from a source pool that matches valuable reasoning patterns in the core set.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical definition of reasoning potential in foundation models

The authors introduce a formal definition of reasoning potential as the probability that a model generates the correct answer when sampling, which is inversely related to the expected number of attempts needed to solve a question. This theoretical framework provides a principled way to measure and optimize model reasoning capabilities.

Contribution

Abstraction of atomic reasoning patterns from CoT sequences

The authors propose extracting atomic reasoning patterns that exhibit commonality and inductive capabilities from chain-of-thought data. These patterns are used to build a core reference set that approximates oracle reasoning data and guides the selection of high-value training samples.

Contribution

Dual-granularity algorithm for selecting high-value CoT data

The authors develop an algorithm using weighted Dynamic Time Warping that operates at two levels of granularity (reasoning pattern chains and token entropy) to efficiently select long chain-of-thought data from a source pool that matches valuable reasoning patterns in the core set.