Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.7 Download Report PDF

Post-TrainingLarge Reasoning ModelsLarge Language ModelsPerformance PredictionReinforcement Learning with Verifiable Rewards

In post-training for reasoning Large Language Models (LLMs), the current state of practice trains LLMs in two independent stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR, shortened as "RL" below). In this work, we challenge whether high SFT scores translate to improved performance after RL. We provide extensive counter-examples where this is not true. We find high SFT scores can be biased toward simpler or more homogeneous data and are not reliably predictive of subsequent RL gains or scaled-up post-training effectiveness. In some cases, RL training on models with improved SFT performance could lead to substantially worse outcome compared to RL on the base model without SFT. We study alternative metrics and identify generalization loss on held-out reasoning examples and Pass@large k performance to provide strong proxies for the RL outcome. We trained hundreds of models up to 12B-parameter with SFT and RLVR via GRPO and ran extensive evaluations on 7 math benchmarks with up to 256 repetitions, spending $>$ 1M GPU hours. Experiments include models from Llama3, Mistral-Nemo, Qwen3 and multiple state-of-the-art SFT/RL datasets. Compared to directly predicting from pre-RL performance, prediction based on generalization loss and Pass@large k achieves substantial higher precision, improving $R^2$ coefficient and Spearman's rank correlation coefficient by up to 0.5 (2x). This provides strong utility for broad use cases. For example, in most experiments, we find SFT training on unique examples for a one epoch underperforms training on half examples for two epochs, either after SFT or SFT-then-RL; With the same SFT budget, training only on short examples may lead to better SFT performance, though, it often leads to worse outcome after RL compared to training on examples with varying lengths. This work develops an enhanced evaluation tool that will be open-sourced.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper investigates whether supervised fine-tuning (SFT) performance reliably predicts subsequent reinforcement learning (RL) outcomes in reasoning-focused large language models. It resides in the 'Standard Sequential Pipeline Analysis' leaf, which contains four papers examining the conventional SFT-then-RL training paradigm. This leaf sits within a broader taxonomy of fifty papers spanning integration frameworks, reasoning enhancement, multimodal methods, and specialized applications. The research direction is moderately populated, with sibling works exploring similar questions about the SFT-RL relationship, suggesting this is an active but not overcrowded area of inquiry.

The taxonomy reveals neighboring research in unified SFT-RL frameworks (seven papers across dynamic weighting and single-stage methods) and alternative RL paradigms (four papers). The paper's focus on predictive metrics distinguishes it from sibling works like 'Harmonizing SFT and RL' or 'Bridging SL and RL,' which emphasize theoretical connections, and from 'RL Outperforms SFT' or 'RL Panacea nor Mirage,' which question the strength of SFT-RL coupling without proposing alternative predictive metrics. The taxonomy's scope notes clarify that this leaf excludes unified methods and domain-specific implementations, positioning the work as a diagnostic study of the standard pipeline rather than a novel training paradigm.

Among twenty-eight candidates examined, the contribution identifying SFT failure modes shows one refutable candidate out of ten examined, suggesting some prior recognition of SFT-RL misalignment. The proposed alternative metrics (generalization loss and Pass@large k) show no refutable candidates across eight examined papers, indicating potential novelty in this specific predictive framework. The evaluation tool contribution similarly shows no refutations among ten candidates. The limited search scope means these statistics reflect top-K semantic matches and citations, not exhaustive coverage, so unexamined work may exist in adjacent research areas or specialized venues.

Given the moderate density of the research area and the limited search scope, the work appears to offer incremental but substantive contributions. The failure mode analysis builds on existing skepticism about SFT-RL relationships, while the alternative metrics and evaluation tools may represent more novel elements. The analysis covers a focused slice of the literature—primarily papers semantically close to the core SFT-RL pipeline question—leaving open the possibility of related insights in meta-learning, transfer studies, or domain-specific fine-tuning branches not fully captured by the search.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: predicting reinforcement learning outcomes from supervised fine-tuning metrics. The field structure reflects a broad landscape where SFT-RL integration frameworks examine how supervised and reinforcement learning stages interact, reasoning and mathematical problem-solving enhancement targets domain-specific cognitive tasks, multimodal and vision-language model fine-tuning extends these paradigms to richer input modalities, specialized application domains address concrete use cases from robotics to clinical reasoning, and meta-learning studies explore transfer and generalization across tasks. Within the integration frameworks, a particularly active line investigates sequential two-stage SFT-then-RL approaches, analyzing whether early supervised metrics can forecast later RL performance. Representative works such as Harmonizing SFT and RL[5] and Bridging SL and RL[7] explore theoretical connections, while others like ReFT[3] and Visual-RFT[2] demonstrate practical pipelines across text and vision domains. A central tension across these branches concerns whether SFT quality reliably predicts RL success or whether the two stages remain largely decoupled. Some studies, including RL Outperforms SFT[4] and RL Panacea nor Mirage[9], question the strength of this relationship, highlighting cases where strong supervised baselines yield disappointing RL gains or vice versa. Quagmires in SFT-RL[0] sits squarely within this debate, examining the standard sequential pipeline and probing which SFT metrics—if any—offer meaningful foresight into downstream RL outcomes. Its emphasis contrasts with works like Roads to Likelihood[16], which focuses on likelihood-based analysis, and RL Heals OOD[45], which investigates how RL can recover from out-of-distribution supervised data. Together, these efforts reveal an open question: whether predictive relationships between SFT and RL are robust design principles or context-dependent phenomena requiring careful empirical validation.

Claimed Contributions

Identification of failure modes where high SFT scores mislead RL outcomes

Can Refute

10 retrieved papers

The authors demonstrate through extensive experiments that models with better post-SFT performance do not always achieve better outcomes after reinforcement learning. They identify specific failure modes where SFT performance is biased toward simpler, repeated, or homogeneous data, leading to misleading predictions of RL success.

10 retrieved papers

Can Refute

Generalization loss and Pass@large k as reliable predictors for RL outcomes

8 retrieved papers

The authors propose two new metrics—generalization loss on validation examples and Pass@k accuracy at large k values—as more reliable predictors of post-RL performance compared to standard SFT evaluation metrics. These metrics improve prediction accuracy and ranking correlation substantially.

8 retrieved papers

Enhanced evaluation tool for reasoning model assessment

10 retrieved papers

The authors developed a new evaluation tool designed to address limitations in existing tools for assessing reasoning models. This tool will be released as open-source to benefit the research community.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] Rl is neither a panacea nor a mirage: Understanding supervised vs. reinforcement learning fine-tuning for llms PDF

Wu, Sifan, Hamdaqa, Mohammad (2025)

[16] All roads lead to likelihood: The value of reinforcement learning in fine-tuning PDF

Swamy, Gokul, Choudhury, Sanjiban, Gokul Swamy, Sun Wen, Sanjiban Choudhury, Wu, Zhiwei Steven, Wen Sun, Bagnell, J. Andrew, Zhiwei Steven Wu, J. Bagnell (2025)

[45] RL Fine-Tuning Heals OOD Forgetting in SFT PDF

Hangzhan Jin, Sitao Luan, Sicheng Lyu, Rabusseau, Guillaume, Reihaneh Rabbany, Precup, Doina, Hamdaqa, Mohammad (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification of failure modes where high SFT scores mislead RL outcomes

[51] Training language models to self-correct via reinforcement learning PDF

Can Refute

[8] Fine-tuning large vision-language models as decision-making agents via reinforcement learning PDF

Cannot Refute

[15] Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved) PDF

Cannot Refute

[52] Preserving diversity in supervised fine-tuning of large language models PDF

Cannot Refute

[53] Supervised Fine-Tuning as Inverse Reinforcement Learning PDF

Cannot Refute

[54] Code Security Vulnerability Repair Using Reinforcement Learning with Large Language Models PDF

Cannot Refute

[55] Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models PDF

Cannot Refute

[56] RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback PDF

Cannot Refute

[57] Teaching Large Language Models to Reason with Reinforcement Learning PDF

Cannot Refute

[58] Fine-tuning language models for factuality PDF

Cannot Refute

Contribution

Generalization loss and Pass@large k as reliable predictors for RL outcomes

[69] Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models PDF

Cannot Refute

[70] -Coder: Value-Based Deep Reinforcement Learning for Program Synthesis PDF

Cannot Refute

[71] RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization PDF

Cannot Refute

[72] Learning Shrinks the Hard Tail: Training-Dependent Inference Scaling in a Solvable Linear Model PDF

Cannot Refute

[73] Best-of-Majority: Minimax-Optimal Strategy for Pass@ Inference Scaling PDF

Cannot Refute

[74] RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs? PDF

Cannot Refute

[75] DELTA: How Does RL Unlock and Transfer New Algorithms in LLMs? PDF

Cannot Refute

[76] Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models PDF

Cannot Refute

Contribution

Enhanced evaluation tool for reasoning model assessment

[59] Emergent analogical reasoning in large language models PDF

Cannot Refute

[60] LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models PDF

Cannot Refute

[61] Justlogic: A comprehensive benchmark for evaluating deductive reasoning in large language models PDF

Cannot Refute

[62] CHANCERY: Evaluating corporate governance reasoning capabilities in language models PDF

Cannot Refute

[63] Stop overthinking: A survey on efficient reasoning for large language models PDF

Cannot Refute

[64] Timebench: A comprehensive evaluation of temporal reasoning abilities in large language models PDF

Cannot Refute

[65] Cladder: Assessing causal reasoning in language models PDF

Cannot Refute

[66] AutoLogi: Automated generation of logic puzzles for evaluating reasoning abilities of large language models PDF

Cannot Refute

[67] LTLBench: Towards Benchmarks for Evaluating Temporal Logic Reasoning in Large Language Models PDF

Cannot Refute

[68] An empirical automated evaluation and analysis of symmetrical reasoning in large language models PDF

Cannot Refute

Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] Rl is neither a panacea nor a mirage: Understanding supervised vs. reinforcement learning fine-tuning for llms PDF

[16] All roads lead to likelihood: The value of reinforcement learning in fine-tuning PDF

[45] RL Fine-Tuning Heals OOD Forgetting in SFT PDF

Contribution Analysis

Identification of failure modes where high SFT scores mislead RL outcomes

[51] Training language models to self-correct via reinforcement learning PDF

[8] Fine-tuning large vision-language models as decision-making agents via reinforcement learning PDF

[15] Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved) PDF

[52] Preserving diversity in supervised fine-tuning of large language models PDF

[53] Supervised Fine-Tuning as Inverse Reinforcement Learning PDF

[54] Code Security Vulnerability Repair Using Reinforcement Learning with Large Language Models PDF

[55] Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models PDF

[56] RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback PDF

[57] Teaching Large Language Models to Reason with Reinforcement Learning PDF

[58] Fine-tuning language models for factuality PDF

Generalization loss and Pass@large k as reliable predictors for RL outcomes

[69] Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models PDF

[70] -Coder: Value-Based Deep Reinforcement Learning for Program Synthesis PDF

[71] RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization PDF

[72] Learning Shrinks the Hard Tail: Training-Dependent Inference Scaling in a Solvable Linear Model PDF

[73] Best-of-Majority: Minimax-Optimal Strategy for Pass@ Inference Scaling PDF

[74] RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs? PDF

[75] DELTA: How Does RL Unlock and Transfer New Algorithms in LLMs? PDF

[76] Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models PDF

Enhanced evaluation tool for reasoning model assessment

[59] Emergent analogical reasoning in large language models PDF

[60] LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models PDF

[61] Justlogic: A comprehensive benchmark for evaluating deductive reasoning in large language models PDF

[62] CHANCERY: Evaluating corporate governance reasoning capabilities in language models PDF

[63] Stop overthinking: A survey on efficient reasoning for large language models PDF

[64] Timebench: A comprehensive evaluation of temporal reasoning abilities in large language models PDF

[65] Cladder: Assessing causal reasoning in language models PDF

[66] AutoLogi: Automated generation of logic puzzles for evaluating reasoning abilities of large language models PDF

[67] LTLBench: Towards Benchmarks for Evaluating Temporal Logic Reasoning in Large Language Models PDF

[68] An empirical automated evaluation and analysis of symmetrical reasoning in large language models PDF

Table of Contents