Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large Language Models; Reasoning; Reinforcement Learning; Supervised Fine-Tuning

Recent advances in large language model (LLM) reasoning have shown that reasoning ability can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at improving performance on questions within the model's original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce \textbf{ReLIFT} (\textbf{Re}inforcement \textbf{L}earning \textbf{I}nterleaved with Online \textbf{F}ine-\textbf{T}uning), a novel training strategy. ReLIFT employs RL for general training, but interleaves it with targeted SFT on challenging questions for which high-quality solutions are collected online. By alternating between RL and SFT, ReLIFT addresses model weaknesses as they emerge. Empirically, ReLIFT outperforms previous RLVR methods by an average of +6.7 points across a suite of six benchmarks (five math reasoning and one out-of-distribution). More importantly, ReLIFT surpasses baselines such as individual RL, individual SFT, and various hybrid approaches while reducing the required training time. These results provide compelling evidence that ReLIFT is a powerful and resource-efficient paradigm for developing capable reasoning models. The code is available at \href{https://anonymous.4open.science/r/Learning-What-Reinforcement-Learning-Can-t-6AFF/}{here}.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ReLIFT, a training strategy that interleaves reinforcement learning with targeted supervised fine-tuning on challenging questions identified during training. Within the taxonomy, it occupies the 'Interleaved Online Fine-Tuning' leaf under 'Hybrid Training Paradigms and Integration Strategies'. Notably, this leaf contains only the original paper itself—no sibling papers are present. This isolation suggests the specific formulation of online interleaving with difficulty-based targeting represents a relatively unexplored niche within the broader hybrid training landscape.

The taxonomy reveals a crowded parent branch ('Hybrid Training Paradigms') with neighboring leaves addressing adaptive alternation, sequential pipelines, and cooperative optimization. These directions collectively contain nine papers exploring various RL-SFT integration strategies. The paper's approach diverges from fixed-order sequential methods and static adaptive weighting by emphasizing online data collection during RL training. The taxonomy's scope and exclude notes clarify that offline SFT-then-RL pipelines belong elsewhere, positioning this work at the boundary between dynamic adaptation and online learning paradigms.

Among thirty candidates examined, the analysis identifies limited prior work overlap. The systematic training dynamics analysis (Contribution A) examined ten candidates with one appearing to refute, while the ReLIFT framework (Contribution B) examined ten candidates with three potential refutations. The performance claims (Contribution C) found no refuting candidates among ten examined. These statistics suggest that while the core interleaving mechanism has some precedent in the limited search scope, the specific combination of difficulty-based targeting and online SFT integration appears less saturated. The modest refutation counts reflect the constrained search scale rather than exhaustive coverage.

Given the top-thirty semantic search scope, the work appears to occupy a sparsely populated research direction within a well-studied parent domain. The single-paper leaf and limited refutations suggest novelty in the specific formulation, though the broader hybrid training paradigm is actively explored. The analysis does not cover potential work outside the semantic neighborhood or recent preprints, leaving open questions about concurrent developments in online interleaving strategies.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Combining reinforcement learning with supervised fine-tuning for language model reasoning. The field has evolved into a rich landscape organized around nine major branches. Hybrid Training Paradigms and Integration Strategies explore how to blend RL and SFT signals, ranging from interleaved schedules to cooperative frameworks that balance exploration and imitation. Reinforcement Learning Mechanisms and Reward Design focus on outcome and process rewards, while Supervised Learning Approaches and Enhancements refine distillation and synthetic data generation. Self-Improvement and Correction Mechanisms enable models to iteratively refine their outputs, and Inference-Time Reasoning and Test-Time Scaling investigate how models can deliberate more effectively at deployment. Domain-Specific Applications and Adaptations tailor these methods to mathematics, coding, and specialized fields, while Multi-Agent and Collaborative Frameworks coordinate multiple reasoning agents. Analysis, Evaluation, and Theoretical Foundations provide empirical insights and formal understanding, and Auxiliary Methods and Supporting Techniques supply tools like retrieval augmentation and curriculum learning. Recent work reveals contrasting philosophies in how RL and SFT should interact. Some studies argue for tight integration where RL squeezes the policy while SFT expands it, as explored in RL vs SL Refactoring[4] and RL Squeezes SFT Expands[35], while others advocate cooperative or step-wise adaptive blending, exemplified by Cooperative SFT RL[43] and Step-wise Adaptive Integration[17]. Interleaved Fine-Tuning[0] sits within the Hybrid Training Paradigms branch, emphasizing online interleaving of RL and SFT updates to maintain stability and leverage fresh exploration. This approach contrasts with works like SuperRL[1] and ReFT[2], which may prioritize different sequencing or reward structures, and complements methods such as ARES[3] that focus on self-correction during training. The central tension across these lines involves balancing sample efficiency, stability, and the ability to generalize reasoning patterns beyond narrow task distributions.

Claimed Contributions

Systematic analysis of RL and SFT training dynamics across question difficulty levels

Can Refute

10 retrieved papers

The authors analyze how reinforcement learning and supervised fine-tuning affect model accuracy on questions of varying difficulty (Easy, Medium, Hard, Hardest). They find that RL excels at improving performance on questions within the model's existing capabilities, while SFT is more effective for enabling progress on questions beyond the model's current scope.

10 retrieved papers

Can Refute

ReLIFT training framework

Can Refute

10 retrieved papers

The authors introduce Reinforcement Learning Interleaved with Online Fine-Tuning (ReLIFT), a training strategy that combines RL for general training with targeted supervised fine-tuning on challenging questions. The method dynamically collects high-quality solutions for hard problems during RL rollouts and performs fine-tuning steps when sufficient examples accumulate.

10 retrieved papers

Can Refute

State-of-the-art performance with reduced resource requirements

10 retrieved papers

The authors demonstrate that ReLIFT achieves superior performance across mathematical reasoning and out-of-distribution benchmarks compared to pure RL, pure SFT, and various hybrid approaches. The method requires less demonstration data and training time while producing more concise solutions.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic analysis of RL and SFT training dynamics across question difficulty levels

[56] The synergy dilemma of long-cot sft and rl: Investigating post-training techniques for reasoning vlms PDF

Can Refute

[2] ReFT: Reasoning with Reinforced Fine-Tuning PDF

Cannot Refute

[8] UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning PDF

Cannot Refute

[13] Bridging Supervised Learning and Reinforcement Learning in Math Reasoning PDF

Cannot Refute

[51] Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning PDF

Cannot Refute

[52] DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning PDF

Cannot Refute

[53] Learning to Clarify by Reinforcement Learning Through Reward-Weighted Fine-Tuning PDF

Cannot Refute

[54] RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning PDF

Cannot Refute

[55] Grounded Reinforcement Learning for Visual Reasoning PDF

Cannot Refute

[57] MedVLThinker: Simple Baselines for Multimodal Medical Reasoning PDF

Cannot Refute

Contribution

ReLIFT training framework

[2] ReFT: Reasoning with Reinforced Fine-Tuning PDF

Can Refute

[59] IN-RIL: Interleaved Reinforcement and Imitation Learning for Policy Fine-Tuning PDF

Can Refute

[63] Computerrl: Scaling end-to-end online reinforcement learning for computer use agents PDF

Can Refute

[58] DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models PDF

Cannot Refute

[60] Training effective deep reinforcement learning agents for real-time life-cycle production optimization PDF

Cannot Refute

[61] MetaEvo-Rec: Self-Evolving Meta-Reinforcement Learning Recommendation with Large-Language-Model Guided Policy Adaptation PDF

Cannot Refute

[62] Reinforcement Learning Approach to Autonomous PID Tuning PDF

Cannot Refute

[64] Marvel: Accelerating Safe Online Reinforcement Learning with Finetuned Offline Policy PDF

Cannot Refute

[65] Evolutionary-assisted reinforcement learning for reservoir real-time production optimization under uncertainty PDF

Cannot Refute

[66] Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents PDF

Cannot Refute

Contribution

State-of-the-art performance with reduced resource requirements

[67] Self-Training Elicits Concise Reasoning in Large Language Models PDF

Cannot Refute

[68] Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond PDF

Cannot Refute

[69] Evolutionary optimization of model merging recipes PDF

Cannot Refute

[70] rstar2-agent: Agentic reasoning technical report PDF

Cannot Refute

[71] First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training PDF

Cannot Refute

[72] Think only when you need with large hybrid-reasoning models PDF

Cannot Refute

[73] Solving Formal Math Problems by Decomposition and Iterative Reflection PDF

Cannot Refute

[74] M1: Towards scalable test-time compute with mamba reasoning models PDF

Cannot Refute

[75] Srpo: A cross-domain implementation of large-scale reinforcement learning on llm PDF

Cannot Refute

[76] ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning PDF

Cannot Refute

Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Systematic analysis of RL and SFT training dynamics across question difficulty levels

[56] The synergy dilemma of long-cot sft and rl: Investigating post-training techniques for reasoning vlms PDF

[2] ReFT: Reasoning with Reinforced Fine-Tuning PDF

[8] UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning PDF

[13] Bridging Supervised Learning and Reinforcement Learning in Math Reasoning PDF

[51] Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning PDF

[52] DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning PDF

[53] Learning to Clarify by Reinforcement Learning Through Reward-Weighted Fine-Tuning PDF

[54] RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning PDF

[55] Grounded Reinforcement Learning for Visual Reasoning PDF

[57] MedVLThinker: Simple Baselines for Multimodal Medical Reasoning PDF

ReLIFT training framework

[2] ReFT: Reasoning with Reinforced Fine-Tuning PDF

[59] IN-RIL: Interleaved Reinforcement and Imitation Learning for Policy Fine-Tuning PDF

[63] Computerrl: Scaling end-to-end online reinforcement learning for computer use agents PDF

[58] DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models PDF

[60] Training effective deep reinforcement learning agents for real-time life-cycle production optimization PDF

[61] MetaEvo-Rec: Self-Evolving Meta-Reinforcement Learning Recommendation with Large-Language-Model Guided Policy Adaptation PDF

[62] Reinforcement Learning Approach to Autonomous PID Tuning PDF

[64] Marvel: Accelerating Safe Online Reinforcement Learning with Finetuned Offline Policy PDF

[65] Evolutionary-assisted reinforcement learning for reservoir real-time production optimization under uncertainty PDF

[66] Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents PDF

State-of-the-art performance with reduced resource requirements

[67] Self-Training Elicits Concise Reasoning in Large Language Models PDF

[68] Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond PDF

[69] Evolutionary optimization of model merging recipes PDF

[70] rstar2-agent: Agentic reasoning technical report PDF

[71] First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training PDF

[72] Think only when you need with large hybrid-reasoning models PDF

[73] Solving Formal Math Problems by Decomposition and Iterative Reflection PDF

[74] M1: Towards scalable test-time compute with mamba reasoning models PDF

[75] Srpo: A cross-domain implementation of large-scale reinforcement learning on llm PDF

[76] ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning PDF

Table of Contents