Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead
Overview
Overall Novelty Assessment
This paper investigates whether supervised fine-tuning (SFT) performance reliably predicts subsequent reinforcement learning (RL) outcomes in reasoning-focused large language models. It resides in the 'Standard Sequential Pipeline Analysis' leaf, which contains four papers examining the conventional SFT-then-RL training paradigm. This leaf sits within a broader taxonomy of fifty papers spanning integration frameworks, reasoning enhancement, multimodal methods, and specialized applications. The research direction is moderately populated, with sibling works exploring similar questions about the SFT-RL relationship, suggesting this is an active but not overcrowded area of inquiry.
The taxonomy reveals neighboring research in unified SFT-RL frameworks (seven papers across dynamic weighting and single-stage methods) and alternative RL paradigms (four papers). The paper's focus on predictive metrics distinguishes it from sibling works like 'Harmonizing SFT and RL' or 'Bridging SL and RL,' which emphasize theoretical connections, and from 'RL Outperforms SFT' or 'RL Panacea nor Mirage,' which question the strength of SFT-RL coupling without proposing alternative predictive metrics. The taxonomy's scope notes clarify that this leaf excludes unified methods and domain-specific implementations, positioning the work as a diagnostic study of the standard pipeline rather than a novel training paradigm.
Among twenty-eight candidates examined, the contribution identifying SFT failure modes shows one refutable candidate out of ten examined, suggesting some prior recognition of SFT-RL misalignment. The proposed alternative metrics (generalization loss and Pass@large k) show no refutable candidates across eight examined papers, indicating potential novelty in this specific predictive framework. The evaluation tool contribution similarly shows no refutations among ten candidates. The limited search scope means these statistics reflect top-K semantic matches and citations, not exhaustive coverage, so unexamined work may exist in adjacent research areas or specialized venues.
Given the moderate density of the research area and the limited search scope, the work appears to offer incremental but substantive contributions. The failure mode analysis builds on existing skepticism about SFT-RL relationships, while the alternative metrics and evaluation tools may represent more novel elements. The analysis covers a focused slice of the literature—primarily papers semantically close to the core SFT-RL pipeline question—leaving open the possibility of related insights in meta-learning, transfer studies, or domain-specific fine-tuning branches not fully captured by the search.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors demonstrate through extensive experiments that models with better post-SFT performance do not always achieve better outcomes after reinforcement learning. They identify specific failure modes where SFT performance is biased toward simpler, repeated, or homogeneous data, leading to misleading predictions of RL success.
The authors propose two new metrics—generalization loss on validation examples and Pass@k accuracy at large k values—as more reliable predictors of post-RL performance compared to standard SFT evaluation metrics. These metrics improve prediction accuracy and ranking correlation substantially.
The authors developed a new evaluation tool designed to address limitations in existing tools for assessing reasoning models. This tool will be released as open-source to benefit the research community.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[9] Rl is neither a panacea nor a mirage: Understanding supervised vs. reinforcement learning fine-tuning for llms PDF
[16] All roads lead to likelihood: The value of reinforcement learning in fine-tuning PDF
[45] RL Fine-Tuning Heals OOD Forgetting in SFT PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Identification of failure modes where high SFT scores mislead RL outcomes
The authors demonstrate through extensive experiments that models with better post-SFT performance do not always achieve better outcomes after reinforcement learning. They identify specific failure modes where SFT performance is biased toward simpler, repeated, or homogeneous data, leading to misleading predictions of RL success.
[51] Training language models to self-correct via reinforcement learning PDF
[8] Fine-tuning large vision-language models as decision-making agents via reinforcement learning PDF
[15] Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved) PDF
[52] Preserving diversity in supervised fine-tuning of large language models PDF
[53] Supervised Fine-Tuning as Inverse Reinforcement Learning PDF
[54] Code Security Vulnerability Repair Using Reinforcement Learning with Large Language Models PDF
[55] Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models PDF
[56] RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback PDF
[57] Teaching Large Language Models to Reason with Reinforcement Learning PDF
[58] Fine-tuning language models for factuality PDF
Generalization loss and Pass@large k as reliable predictors for RL outcomes
The authors propose two new metrics—generalization loss on validation examples and Pass@k accuracy at large k values—as more reliable predictors of post-RL performance compared to standard SFT evaluation metrics. These metrics improve prediction accuracy and ranking correlation substantially.
[69] Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models PDF
[70] -Coder: Value-Based Deep Reinforcement Learning for Program Synthesis PDF
[71] RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization PDF
[72] Learning Shrinks the Hard Tail: Training-Dependent Inference Scaling in a Solvable Linear Model PDF
[73] Best-of-Majority: Minimax-Optimal Strategy for Pass@ Inference Scaling PDF
[74] RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs? PDF
[75] DELTA: How Does RL Unlock and Transfer New Algorithms in LLMs? PDF
[76] Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models PDF
Enhanced evaluation tool for reasoning model assessment
The authors developed a new evaluation tool designed to address limitations in existing tools for assessing reasoning models. This tool will be released as open-source to benefit the research community.