Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language Models; Reasoning; Reinforcement Learning; Supervised Fine-Tuning
Abstract:

Recent advances in large language model (LLM) reasoning have shown that reasoning ability can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at improving performance on questions within the model's original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce \textbf{ReLIFT} (\textbf{Re}inforcement \textbf{L}earning \textbf{I}nterleaved with Online \textbf{F}ine-\textbf{T}uning), a novel training strategy. ReLIFT employs RL for general training, but interleaves it with targeted SFT on challenging questions for which high-quality solutions are collected online. By alternating between RL and SFT, ReLIFT addresses model weaknesses as they emerge. Empirically, ReLIFT outperforms previous RLVR methods by an average of +6.7 points across a suite of six benchmarks (five math reasoning and one out-of-distribution). More importantly, ReLIFT surpasses baselines such as individual RL, individual SFT, and various hybrid approaches while reducing the required training time. These results provide compelling evidence that ReLIFT is a powerful and resource-efficient paradigm for developing capable reasoning models. The code is available at \href{https://anonymous.4open.science/r/Learning-What-Reinforcement-Learning-Can-t-6AFF/}{here}.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ReLIFT, a training strategy that interleaves reinforcement learning with targeted supervised fine-tuning on challenging questions identified during training. Within the taxonomy, it occupies the 'Interleaved Online Fine-Tuning' leaf under 'Hybrid Training Paradigms and Integration Strategies'. Notably, this leaf contains only the original paper itself—no sibling papers are present. This isolation suggests the specific formulation of online interleaving with difficulty-based targeting represents a relatively unexplored niche within the broader hybrid training landscape.

The taxonomy reveals a crowded parent branch ('Hybrid Training Paradigms') with neighboring leaves addressing adaptive alternation, sequential pipelines, and cooperative optimization. These directions collectively contain nine papers exploring various RL-SFT integration strategies. The paper's approach diverges from fixed-order sequential methods and static adaptive weighting by emphasizing online data collection during RL training. The taxonomy's scope and exclude notes clarify that offline SFT-then-RL pipelines belong elsewhere, positioning this work at the boundary between dynamic adaptation and online learning paradigms.

Among thirty candidates examined, the analysis identifies limited prior work overlap. The systematic training dynamics analysis (Contribution A) examined ten candidates with one appearing to refute, while the ReLIFT framework (Contribution B) examined ten candidates with three potential refutations. The performance claims (Contribution C) found no refuting candidates among ten examined. These statistics suggest that while the core interleaving mechanism has some precedent in the limited search scope, the specific combination of difficulty-based targeting and online SFT integration appears less saturated. The modest refutation counts reflect the constrained search scale rather than exhaustive coverage.

Given the top-thirty semantic search scope, the work appears to occupy a sparsely populated research direction within a well-studied parent domain. The single-paper leaf and limited refutations suggest novelty in the specific formulation, though the broader hybrid training paradigm is actively explored. The analysis does not cover potential work outside the semantic neighborhood or recent preprints, leaving open questions about concurrent developments in online interleaving strategies.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: Combining reinforcement learning with supervised fine-tuning for language model reasoning. The field has evolved into a rich landscape organized around nine major branches. Hybrid Training Paradigms and Integration Strategies explore how to blend RL and SFT signals, ranging from interleaved schedules to cooperative frameworks that balance exploration and imitation. Reinforcement Learning Mechanisms and Reward Design focus on outcome and process rewards, while Supervised Learning Approaches and Enhancements refine distillation and synthetic data generation. Self-Improvement and Correction Mechanisms enable models to iteratively refine their outputs, and Inference-Time Reasoning and Test-Time Scaling investigate how models can deliberate more effectively at deployment. Domain-Specific Applications and Adaptations tailor these methods to mathematics, coding, and specialized fields, while Multi-Agent and Collaborative Frameworks coordinate multiple reasoning agents. Analysis, Evaluation, and Theoretical Foundations provide empirical insights and formal understanding, and Auxiliary Methods and Supporting Techniques supply tools like retrieval augmentation and curriculum learning. Recent work reveals contrasting philosophies in how RL and SFT should interact. Some studies argue for tight integration where RL squeezes the policy while SFT expands it, as explored in RL vs SL Refactoring[4] and RL Squeezes SFT Expands[35], while others advocate cooperative or step-wise adaptive blending, exemplified by Cooperative SFT RL[43] and Step-wise Adaptive Integration[17]. Interleaved Fine-Tuning[0] sits within the Hybrid Training Paradigms branch, emphasizing online interleaving of RL and SFT updates to maintain stability and leverage fresh exploration. This approach contrasts with works like SuperRL[1] and ReFT[2], which may prioritize different sequencing or reward structures, and complements methods such as ARES[3] that focus on self-correction during training. The central tension across these lines involves balancing sample efficiency, stability, and the ability to generalize reasoning patterns beyond narrow task distributions.

Claimed Contributions

Systematic analysis of RL and SFT training dynamics across question difficulty levels

The authors analyze how reinforcement learning and supervised fine-tuning affect model accuracy on questions of varying difficulty (Easy, Medium, Hard, Hardest). They find that RL excels at improving performance on questions within the model's existing capabilities, while SFT is more effective for enabling progress on questions beyond the model's current scope.

10 retrieved papers
Can Refute
ReLIFT training framework

The authors introduce Reinforcement Learning Interleaved with Online Fine-Tuning (ReLIFT), a training strategy that combines RL for general training with targeted supervised fine-tuning on challenging questions. The method dynamically collects high-quality solutions for hard problems during RL rollouts and performs fine-tuning steps when sufficient examples accumulate.

10 retrieved papers
Can Refute
State-of-the-art performance with reduced resource requirements

The authors demonstrate that ReLIFT achieves superior performance across mathematical reasoning and out-of-distribution benchmarks compared to pure RL, pure SFT, and various hybrid approaches. The method requires less demonstration data and training time while producing more concise solutions.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic analysis of RL and SFT training dynamics across question difficulty levels

The authors analyze how reinforcement learning and supervised fine-tuning affect model accuracy on questions of varying difficulty (Easy, Medium, Hard, Hardest). They find that RL excels at improving performance on questions within the model's existing capabilities, while SFT is more effective for enabling progress on questions beyond the model's current scope.

Contribution

ReLIFT training framework

The authors introduce Reinforcement Learning Interleaved with Online Fine-Tuning (ReLIFT), a training strategy that combines RL for general training with targeted supervised fine-tuning on challenging questions. The method dynamically collects high-quality solutions for hard problems during RL rollouts and performs fine-tuning steps when sufficient examples accumulate.

Contribution

State-of-the-art performance with reduced resource requirements

The authors demonstrate that ReLIFT achieves superior performance across mathematical reasoning and out-of-distribution benchmarks compared to pure RL, pure SFT, and various hybrid approaches. The method requires less demonstration data and training time while producing more concise solutions.