Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large Language ModelsReasoningReinforcement Learning with Verifiable RewardsLong Chain-of-Thought

Reinforcement Learning with Verifiable Reward (RLVR) effectively solves complex tasks but demands extremely long context lengths during training, leading to substantial computational costs. While multi-stage training can partially mitigate this, starting with overly short contexts often causes irreversible performance degradation, ultimately failing to reduce overall training compute significantly. In this paper, we introduce Thinking-Free Policy Initialization (TFPI), a simple yet effective adaptation to RLVR that bridges long Chain-of-Thought (CoT) distillation and standard RLVR. TFPI employs a simple ThinkFree operation, explicitly discarding the thinking content via a direct </think> append, to reduce token usage during inference. Training with ThinkFree-adapted inputs improves performance and lowers token consumption, even in the original slow-thinking mode. Extensive experiments across various benchmarks have shown that {\method} accelerates RL convergence, achieves a higher performance ceiling, and yields more token-efficient reasoning models without specialized rewards or complex training designs. With TFPI only, we train a 4B model to reach 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench using less than 4K H20 hours.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Thinking-Free Policy Initialization (TFPI), a method that bridges long Chain-of-Thought distillation and standard reinforcement learning with verifiable rewards. It sits in the 'Policy Initialization and Transfer Learning' leaf of the taxonomy, which currently contains only this paper as a sibling. This placement reflects a relatively sparse research direction focused on using initialization strategies and transfer learning to reduce training costs for reasoning models, distinct from the more crowded areas of policy optimization algorithms or CoT pruning techniques.

The taxonomy reveals that TFPI's leaf is nested within 'Efficiency Optimization for Long CoT Reasoning', which includes three other leaves: CoT Length Control and Pruning (three papers on inference-time optimization), Training-Free and Activation-Based Methods (one paper on prompting techniques), and General Efficiency Surveys. Neighboring branches address RL training methods (policy optimization, reward modeling, data synthesis) and domain-specific applications. TFPI diverges from pruning-focused neighbors by targeting training efficiency through initialization rather than inference-time token reduction, and from training-free methods by requiring RL fine-tuning rather than pure prompting.

Among thirty candidates examined, the ThinkingFree operation shows overlap with three papers addressing similar inference efficiency goals, while token-efficient reasoning without specialized rewards encounters four potentially overlapping works. The TFPI method itself appears more distinctive, with zero clear refutations among ten candidates examined. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. Contribution one (ThinkingFree operation) and contribution three (token efficiency) face more substantial prior work, while contribution two (the TFPI initialization strategy) appears less directly anticipated in the examined literature.

Based on the thirty candidates examined, TFPI occupies a relatively underexplored niche at the intersection of policy initialization and efficiency optimization. The taxonomy structure suggests this is a sparse research direction compared to more crowded areas like policy optimization or reward modeling. However, the limited search scope and the presence of overlapping work for two of three contributions indicate that claims of novelty should be tempered by acknowledgment of what the top-K semantic search may not have captured.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Efficient reinforcement learning for long chain-of-thought reasoning models. The field is organized around four main branches that capture distinct but interrelated challenges. Reinforcement Learning Training Methods for CoT Reasoning encompasses works on policy optimization, reward shaping, and training algorithms tailored to multi-step reasoning, including approaches like WizardMath[12] and ARES[11] that refine how RL signals guide reasoning chains. Efficiency Optimization for Long CoT Reasoning addresses computational bottlenecks through techniques such as pruning, distillation, and policy initialization strategies, with studies like O1-Pruner[32] and ThinkPrune[40] reducing inference costs. Domain-Specific and Multimodal CoT Reasoning extends reasoning capabilities to specialized tasks and modalities, as seen in Vision Language CoT[3], Reinforced MLLM[4], and domain applications like Time Series CoT[42]. Reasoning Mechanisms and Theoretical Foundations explores the underlying principles of how models generate and evaluate reasoning steps, with works such as Demystifying Long CoT[1] and Large Reasoning Models[2] investigating what makes extended reasoning effective. A particularly active line of work focuses on balancing reasoning quality with computational efficiency, where methods must decide when extended thinking justifies its cost versus when compact solutions suffice, as explored in When More Less[7]. Another contrast emerges between training-time interventions like ProRL[9] and inference-time optimizations such as Multi-Step Search[5]. Thinking-Free Policy[0] sits within the Efficiency Optimization branch under Policy Initialization and Transfer Learning, addressing how to bootstrap effective reasoning policies without exhaustive search during training. Its emphasis on transfer learning distinguishes it from pruning-focused neighbors like O1-Pruner[32] and from search-based methods like Policy-Guided Path[6], instead exploring how pre-trained knowledge can accelerate the acquisition of long-horizon reasoning capabilities while maintaining computational tractability.

Claimed Contributions

ThinkingFree operation for efficient inference and training

Can Refute

10 retrieved papers

The authors introduce a ThinkingFree operation that explicitly discards thinking content by appending </think> directly to queries. This operation reduces token usage during inference by over 70% and, when used during training, improves performance and lowers token consumption even when models are evaluated in the original slow-thinking mode.

10 retrieved papers

Can Refute

Thinking-Free Policy Initialization (TFPI) method

10 retrieved papers

TFPI is a dedicated training stage that precedes standard RLVR for SFT-distilled large reasoning models. It applies the ThinkingFree operation during rollout to reduce training context requirements while maintaining or improving reasoning capabilities. TFPI serves as an efficient foundation that accelerates subsequent RL convergence and transfers across domains.

10 retrieved papers

Token-efficient reasoning models without specialized rewards

Can Refute

10 retrieved papers

The authors demonstrate that TFPI enables training of reasoning models that achieve both higher accuracy and greater token efficiency compared to direct RL approaches. This is accomplished without requiring specialized length-based reward shaping or complex training strategies, offering a simpler alternative paradigm for building efficient large reasoning models.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ThinkingFree operation for efficient inference and training

[68] Stop overthinking: A survey on efficient reasoning for large language models PDF

Can Refute

[71] Efficient inference for large reasoning models: A survey PDF

Can Refute

[73] ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy PDF

Can Refute

[69] Boosting multimodal large language models with visual tokens withdrawal for rapid inference PDF

Cannot Refute

[70] Cothink: Token-efficient reasoning via instruct models guiding reasoning models PDF

Cannot Refute

[72] Tokenskip: Controllable chain-of-thought compression in llms PDF

Cannot Refute

[74] Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning PDF

Cannot Refute

[75] Infinitehip: Extending language model context up to 3 million tokens on a single gpu PDF

Cannot Refute

[76] Efficient Reasoning with Hidden Thinking PDF

Cannot Refute

[77] ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models PDF

Cannot Refute

Contribution

Thinking-Free Policy Initialization (TFPI) method

[58] Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning PDF

Cannot Refute

[59] AdaptThink: Reasoning Models Can Learn When to Think PDF

Cannot Refute

[60] A study on efficient reinforcement learning through knowledge transfer PDF

Cannot Refute

[61] Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning PDF

Cannot Refute

[62] Transfer Learning for Multiagent Reinforcement Learning Systems PDF

Cannot Refute

[63] Transferring knowledge as heuristics in reinforcement learning: A case-based approach PDF

Cannot Refute

[64] Heuristically accelerated reinforcement learning by means of case-based reasoning and transfer learning PDF

Cannot Refute

[65] Meta-Reinforcement Learning-Based Transferable Scheduling Strategy for Energy Management PDF

Cannot Refute

[66] Learning to Learn a Cold-start Sequential Recommender PDF

Cannot Refute

[67] Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning PDF

Cannot Refute

Contribution

Token-efficient reasoning models without specialized rewards

[14] Training language models to reason efficiently PDF

Can Refute

[18] L1: Controlling how long a reasoning model thinks with reinforcement learning PDF

Can Refute

[40] Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning PDF

Can Refute

[54] ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning PDF

Can Refute

[51] ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models PDF

Cannot Refute

[52] Promoting Efficient Reasoning with Verifiable Stepwise Reward PDF

Cannot Refute

[53] Train Long, Think Short: Curriculum Learning for Efficient Reasoning PDF

Cannot Refute

[55] Acting Less is Reasoning More! Teaching Model to Act Efficiently PDF

Cannot Refute

[56] SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning PDF

Cannot Refute

[57] Optimizing anytime reasoning via budget relative policy optimization PDF

Cannot Refute

Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

ThinkingFree operation for efficient inference and training

[68] Stop overthinking: A survey on efficient reasoning for large language models PDF

[71] Efficient inference for large reasoning models: A survey PDF

[73] ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy PDF

[69] Boosting multimodal large language models with visual tokens withdrawal for rapid inference PDF

[70] Cothink: Token-efficient reasoning via instruct models guiding reasoning models PDF

[72] Tokenskip: Controllable chain-of-thought compression in llms PDF

[74] Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning PDF

[75] Infinitehip: Extending language model context up to 3 million tokens on a single gpu PDF

[76] Efficient Reasoning with Hidden Thinking PDF

[77] ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models PDF

Thinking-Free Policy Initialization (TFPI) method

[58] Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning PDF

[59] AdaptThink: Reasoning Models Can Learn When to Think PDF

[60] A study on efficient reinforcement learning through knowledge transfer PDF

[61] Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning PDF

[62] Transfer Learning for Multiagent Reinforcement Learning Systems PDF

[63] Transferring knowledge as heuristics in reinforcement learning: A case-based approach PDF

[64] Heuristically accelerated reinforcement learning by means of case-based reasoning and transfer learning PDF

[65] Meta-Reinforcement Learning-Based Transferable Scheduling Strategy for Energy Management PDF

[66] Learning to Learn a Cold-start Sequential Recommender PDF

[67] Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning PDF

Token-efficient reasoning models without specialized rewards

[14] Training language models to reason efficiently PDF

[18] L1: Controlling how long a reasoning model thinks with reinforcement learning PDF

[40] Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning PDF

[54] ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning PDF

[51] ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models PDF

[52] Promoting Efficient Reasoning with Verifiable Stepwise Reward PDF

[53] Train Long, Think Short: Curriculum Learning for Efficient Reasoning PDF

[55] Acting Less is Reasoning More! Teaching Model to Act Efficiently PDF

[56] SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning PDF

[57] Optimizing anytime reasoning via budget relative policy optimization PDF

Table of Contents