Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelsReasoningReinforcement Learning with Verifiable RewardsLong Chain-of-Thought
Abstract:

Reinforcement Learning with Verifiable Reward (RLVR) effectively solves complex tasks but demands extremely long context lengths during training, leading to substantial computational costs. While multi-stage training can partially mitigate this, starting with overly short contexts often causes irreversible performance degradation, ultimately failing to reduce overall training compute significantly. In this paper, we introduce Thinking-Free Policy Initialization (TFPI), a simple yet effective adaptation to RLVR that bridges long Chain-of-Thought (CoT) distillation and standard RLVR. TFPI employs a simple ThinkFree operation, explicitly discarding the thinking content via a direct </think> append, to reduce token usage during inference. Training with ThinkFree-adapted inputs improves performance and lowers token consumption, even in the original slow-thinking mode. Extensive experiments across various benchmarks have shown that {\method} accelerates RL convergence, achieves a higher performance ceiling, and yields more token-efficient reasoning models without specialized rewards or complex training designs. With TFPI only, we train a 4B model to reach 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench using less than 4K H20 hours.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Thinking-Free Policy Initialization (TFPI), a method that bridges long Chain-of-Thought distillation and standard reinforcement learning with verifiable rewards. It sits in the 'Policy Initialization and Transfer Learning' leaf of the taxonomy, which currently contains only this paper as a sibling. This placement reflects a relatively sparse research direction focused on using initialization strategies and transfer learning to reduce training costs for reasoning models, distinct from the more crowded areas of policy optimization algorithms or CoT pruning techniques.

The taxonomy reveals that TFPI's leaf is nested within 'Efficiency Optimization for Long CoT Reasoning', which includes three other leaves: CoT Length Control and Pruning (three papers on inference-time optimization), Training-Free and Activation-Based Methods (one paper on prompting techniques), and General Efficiency Surveys. Neighboring branches address RL training methods (policy optimization, reward modeling, data synthesis) and domain-specific applications. TFPI diverges from pruning-focused neighbors by targeting training efficiency through initialization rather than inference-time token reduction, and from training-free methods by requiring RL fine-tuning rather than pure prompting.

Among thirty candidates examined, the ThinkingFree operation shows overlap with three papers addressing similar inference efficiency goals, while token-efficient reasoning without specialized rewards encounters four potentially overlapping works. The TFPI method itself appears more distinctive, with zero clear refutations among ten candidates examined. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. Contribution one (ThinkingFree operation) and contribution three (token efficiency) face more substantial prior work, while contribution two (the TFPI initialization strategy) appears less directly anticipated in the examined literature.

Based on the thirty candidates examined, TFPI occupies a relatively underexplored niche at the intersection of policy initialization and efficiency optimization. The taxonomy structure suggests this is a sparse research direction compared to more crowded areas like policy optimization or reward modeling. However, the limited search scope and the presence of overlapping work for two of three contributions indicate that claims of novelty should be tempered by acknowledgment of what the top-K semantic search may not have captured.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
7
Refutable Paper

Research Landscape Overview

Core task: Efficient reinforcement learning for long chain-of-thought reasoning models. The field is organized around four main branches that capture distinct but interrelated challenges. Reinforcement Learning Training Methods for CoT Reasoning encompasses works on policy optimization, reward shaping, and training algorithms tailored to multi-step reasoning, including approaches like WizardMath[12] and ARES[11] that refine how RL signals guide reasoning chains. Efficiency Optimization for Long CoT Reasoning addresses computational bottlenecks through techniques such as pruning, distillation, and policy initialization strategies, with studies like O1-Pruner[32] and ThinkPrune[40] reducing inference costs. Domain-Specific and Multimodal CoT Reasoning extends reasoning capabilities to specialized tasks and modalities, as seen in Vision Language CoT[3], Reinforced MLLM[4], and domain applications like Time Series CoT[42]. Reasoning Mechanisms and Theoretical Foundations explores the underlying principles of how models generate and evaluate reasoning steps, with works such as Demystifying Long CoT[1] and Large Reasoning Models[2] investigating what makes extended reasoning effective. A particularly active line of work focuses on balancing reasoning quality with computational efficiency, where methods must decide when extended thinking justifies its cost versus when compact solutions suffice, as explored in When More Less[7]. Another contrast emerges between training-time interventions like ProRL[9] and inference-time optimizations such as Multi-Step Search[5]. Thinking-Free Policy[0] sits within the Efficiency Optimization branch under Policy Initialization and Transfer Learning, addressing how to bootstrap effective reasoning policies without exhaustive search during training. Its emphasis on transfer learning distinguishes it from pruning-focused neighbors like O1-Pruner[32] and from search-based methods like Policy-Guided Path[6], instead exploring how pre-trained knowledge can accelerate the acquisition of long-horizon reasoning capabilities while maintaining computational tractability.

Claimed Contributions

ThinkingFree operation for efficient inference and training

The authors introduce a ThinkingFree operation that explicitly discards thinking content by appending </think> directly to queries. This operation reduces token usage during inference by over 70% and, when used during training, improves performance and lowers token consumption even when models are evaluated in the original slow-thinking mode.

10 retrieved papers
Can Refute
Thinking-Free Policy Initialization (TFPI) method

TFPI is a dedicated training stage that precedes standard RLVR for SFT-distilled large reasoning models. It applies the ThinkingFree operation during rollout to reduce training context requirements while maintaining or improving reasoning capabilities. TFPI serves as an efficient foundation that accelerates subsequent RL convergence and transfers across domains.

10 retrieved papers
Token-efficient reasoning models without specialized rewards

The authors demonstrate that TFPI enables training of reasoning models that achieve both higher accuracy and greater token efficiency compared to direct RL approaches. This is accomplished without requiring specialized length-based reward shaping or complex training strategies, offering a simpler alternative paradigm for building efficient large reasoning models.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ThinkingFree operation for efficient inference and training

The authors introduce a ThinkingFree operation that explicitly discards thinking content by appending </think> directly to queries. This operation reduces token usage during inference by over 70% and, when used during training, improves performance and lowers token consumption even when models are evaluated in the original slow-thinking mode.

Contribution

Thinking-Free Policy Initialization (TFPI) method

TFPI is a dedicated training stage that precedes standard RLVR for SFT-distilled large reasoning models. It applies the ThinkingFree operation during rollout to reduce training context requirements while maintaining or improving reasoning capabilities. TFPI serves as an efficient foundation that accelerates subsequent RL convergence and transfers across domains.

Contribution

Token-efficient reasoning models without specialized rewards

The authors demonstrate that TFPI enables training of reasoning models that achieve both higher accuracy and greater token efficiency compared to direct RL approaches. This is accomplished without requiring specialized length-based reward shaping or complex training strategies, offering a simpler alternative paradigm for building efficient large reasoning models.