Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead

ICLR 2026 Conference SubmissionAnonymous Authors
Post-TrainingLarge Reasoning ModelsLarge Language ModelsPerformance PredictionReinforcement Learning with Verifiable Rewards
Abstract:

In post-training for reasoning Large Language Models (LLMs), the current state of practice trains LLMs in two independent stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR, shortened as "RL" below). In this work, we challenge whether high SFT scores translate to improved performance after RL. We provide extensive counter-examples where this is not true. We find high SFT scores can be biased toward simpler or more homogeneous data and are not reliably predictive of subsequent RL gains or scaled-up post-training effectiveness. In some cases, RL training on models with improved SFT performance could lead to substantially worse outcome compared to RL on the base model without SFT. We study alternative metrics and identify generalization loss on held-out reasoning examples and Pass@large k performance to provide strong proxies for the RL outcome. We trained hundreds of models up to 12B-parameter with SFT and RLVR via GRPO and ran extensive evaluations on 7 math benchmarks with up to 256 repetitions, spending >>1M GPU hours. Experiments include models from Llama3, Mistral-Nemo, Qwen3 and multiple state-of-the-art SFT/RL datasets. Compared to directly predicting from pre-RL performance, prediction based on generalization loss and Pass@large k achieves substantial higher precision, improving R2R^2 coefficient and Spearman's rank correlation coefficient by up to 0.5 (2x). This provides strong utility for broad use cases. For example, in most experiments, we find SFT training on unique examples for a one epoch underperforms training on half examples for two epochs, either after SFT or SFT-then-RL; With the same SFT budget, training only on short examples may lead to better SFT performance, though, it often leads to worse outcome after RL compared to training on examples with varying lengths. This work develops an enhanced evaluation tool that will be open-sourced.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper investigates whether supervised fine-tuning (SFT) performance reliably predicts subsequent reinforcement learning (RL) outcomes in reasoning-focused large language models. It resides in the 'Standard Sequential Pipeline Analysis' leaf, which contains four papers examining the conventional SFT-then-RL training paradigm. This leaf sits within a broader taxonomy of fifty papers spanning integration frameworks, reasoning enhancement, multimodal methods, and specialized applications. The research direction is moderately populated, with sibling works exploring similar questions about the SFT-RL relationship, suggesting this is an active but not overcrowded area of inquiry.

The taxonomy reveals neighboring research in unified SFT-RL frameworks (seven papers across dynamic weighting and single-stage methods) and alternative RL paradigms (four papers). The paper's focus on predictive metrics distinguishes it from sibling works like 'Harmonizing SFT and RL' or 'Bridging SL and RL,' which emphasize theoretical connections, and from 'RL Outperforms SFT' or 'RL Panacea nor Mirage,' which question the strength of SFT-RL coupling without proposing alternative predictive metrics. The taxonomy's scope notes clarify that this leaf excludes unified methods and domain-specific implementations, positioning the work as a diagnostic study of the standard pipeline rather than a novel training paradigm.

Among twenty-eight candidates examined, the contribution identifying SFT failure modes shows one refutable candidate out of ten examined, suggesting some prior recognition of SFT-RL misalignment. The proposed alternative metrics (generalization loss and Pass@large k) show no refutable candidates across eight examined papers, indicating potential novelty in this specific predictive framework. The evaluation tool contribution similarly shows no refutations among ten candidates. The limited search scope means these statistics reflect top-K semantic matches and citations, not exhaustive coverage, so unexamined work may exist in adjacent research areas or specialized venues.

Given the moderate density of the research area and the limited search scope, the work appears to offer incremental but substantive contributions. The failure mode analysis builds on existing skepticism about SFT-RL relationships, while the alternative metrics and evaluation tools may represent more novel elements. The analysis covers a focused slice of the literature—primarily papers semantically close to the core SFT-RL pipeline question—leaving open the possibility of related insights in meta-learning, transfer studies, or domain-specific fine-tuning branches not fully captured by the search.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: predicting reinforcement learning outcomes from supervised fine-tuning metrics. The field structure reflects a broad landscape where SFT-RL integration frameworks examine how supervised and reinforcement learning stages interact, reasoning and mathematical problem-solving enhancement targets domain-specific cognitive tasks, multimodal and vision-language model fine-tuning extends these paradigms to richer input modalities, specialized application domains address concrete use cases from robotics to clinical reasoning, and meta-learning studies explore transfer and generalization across tasks. Within the integration frameworks, a particularly active line investigates sequential two-stage SFT-then-RL approaches, analyzing whether early supervised metrics can forecast later RL performance. Representative works such as Harmonizing SFT and RL[5] and Bridging SL and RL[7] explore theoretical connections, while others like ReFT[3] and Visual-RFT[2] demonstrate practical pipelines across text and vision domains. A central tension across these branches concerns whether SFT quality reliably predicts RL success or whether the two stages remain largely decoupled. Some studies, including RL Outperforms SFT[4] and RL Panacea nor Mirage[9], question the strength of this relationship, highlighting cases where strong supervised baselines yield disappointing RL gains or vice versa. Quagmires in SFT-RL[0] sits squarely within this debate, examining the standard sequential pipeline and probing which SFT metrics—if any—offer meaningful foresight into downstream RL outcomes. Its emphasis contrasts with works like Roads to Likelihood[16], which focuses on likelihood-based analysis, and RL Heals OOD[45], which investigates how RL can recover from out-of-distribution supervised data. Together, these efforts reveal an open question: whether predictive relationships between SFT and RL are robust design principles or context-dependent phenomena requiring careful empirical validation.

Claimed Contributions

Identification of failure modes where high SFT scores mislead RL outcomes

The authors demonstrate through extensive experiments that models with better post-SFT performance do not always achieve better outcomes after reinforcement learning. They identify specific failure modes where SFT performance is biased toward simpler, repeated, or homogeneous data, leading to misleading predictions of RL success.

10 retrieved papers
Can Refute
Generalization loss and Pass@large k as reliable predictors for RL outcomes

The authors propose two new metrics—generalization loss on validation examples and Pass@k accuracy at large k values—as more reliable predictors of post-RL performance compared to standard SFT evaluation metrics. These metrics improve prediction accuracy and ranking correlation substantially.

8 retrieved papers
Enhanced evaluation tool for reasoning model assessment

The authors developed a new evaluation tool designed to address limitations in existing tools for assessing reasoning models. This tool will be released as open-source to benefit the research community.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification of failure modes where high SFT scores mislead RL outcomes

The authors demonstrate through extensive experiments that models with better post-SFT performance do not always achieve better outcomes after reinforcement learning. They identify specific failure modes where SFT performance is biased toward simpler, repeated, or homogeneous data, leading to misleading predictions of RL success.

Contribution

Generalization loss and Pass@large k as reliable predictors for RL outcomes

The authors propose two new metrics—generalization loss on validation examples and Pass@k accuracy at large k values—as more reliable predictors of post-RL performance compared to standard SFT evaluation metrics. These metrics improve prediction accuracy and ranking correlation substantially.

Contribution

Enhanced evaluation tool for reasoning model assessment

The authors developed a new evaluation tool designed to address limitations in existing tools for assessing reasoning models. This tool will be released as open-source to benefit the research community.