AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy

ICLR 2026 Conference SubmissionAnonymous Authors
LLMReasoningReinforcement LearningSupervised Fine-tuningMathCode
Abstract:

In this work, we investigate the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in developing strong reasoning models. We begin by curating the SFT training data through two scaling strategies: increasing the number of collected prompts and the number of generated responses per prompt. Both approaches yield notable improvements in reasoning performance, with scaling the number of prompts resulting in more substantial gains. We then explore the following questions regarding the synergy between SFT and RL: (i) Does a stronger SFT model consistently lead to better final performance after large-scale RL training? (ii) How can we determine an appropriate sampling temperature during RL training to effectively balance exploration and exploitation for a given SFT initialization?
Our findings suggest that (i) holds true, provided effective RL training is conducted, particularly when the sampling temperature is carefully chosen to maintain the temperature-adjusted entropy around 0.3, a setting that strikes a good balance between exploration and exploitation. Notably, the performance gap between initial SFT models narrows significantly throughout the RL process. Built on a strong SFT foundation and SFT–RL synergy, our AceReason-Nemotron-1.1 7B model significantly outperforms AceReason-Nemotron-1.0 and achieves new state-of-the-art performance among Qwen2.5-7B-based reasoning models on challenging math and code benchmarks, thereby demonstrating the effectiveness of our post-training recipe.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper investigates how supervised fine-tuning (SFT) and reinforcement learning (RL) interact to produce strong reasoning models, focusing on data scaling strategies and temperature selection during RL training. It resides in the Mathematical and Code Reasoning leaf, which contains six papers addressing SFT-RL synergy in structured reasoning domains. This leaf sits within the broader Application Domains branch, indicating a moderately populated research direction where domain-specific methods are actively explored. The paper's emphasis on systematic scaling and temperature tuning positions it alongside works examining training orchestration and data quality in mathematical reasoning tasks.

The taxonomy reveals that Mathematical and Code Reasoning is one of several application-focused branches, with neighboring leaves covering Vision-Language Reasoning (16 papers across four sub-leaves) and Specialized Domain Applications (3 papers). The Integration Frameworks branch (12 papers across three leaves) explores general training paradigms, while Theoretical Foundations (9 papers across three leaves) examines mechanistic analyses and comparative studies. The paper's focus on practical training guidelines connects it to Sequential Training Strategies and Mechanistic Analysis leaves, though it remains grounded in mathematical reasoning applications rather than proposing domain-agnostic frameworks.

Among 29 candidates examined, no contributions were clearly refuted. The first contribution (SFT data scaling strategies) examined 10 candidates with zero refutable matches, suggesting limited prior work on systematic prompt versus response scaling comparisons. The second contribution (SFT-RL synergy and temperature selection) examined 9 candidates with no refutations, indicating that the specific temperature-entropy guideline (maintaining 0.3 entropy) may represent a novel empirical finding within the limited search scope. The third contribution (AceReason-Nemotron model) examined 10 candidates with no overlaps, though model releases are inherently unique artifacts. These statistics reflect a top-K semantic search, not exhaustive coverage.

Based on the limited literature search, the work appears to offer incremental advances in understanding SFT-RL interactions for mathematical reasoning. The temperature selection guideline and scaling analysis provide practical insights, though the absence of refutations may partly reflect the search scope rather than absolute novelty. The taxonomy context shows this is an active but not overcrowded research direction, with room for empirical studies that bridge training paradigms and domain-specific applications.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Synergy between supervised fine-tuning and reinforcement learning for reasoning models. The field explores how supervised fine-tuning (SFT) and reinforcement learning (RL) can be combined to enhance reasoning capabilities in language and multimodal models. The taxonomy reveals several major branches: Integration Frameworks and Training Paradigms examine how to orchestrate SFT and RL stages, including sequential pipelines and interleaved approaches like Interleaved Online Fine-Tuning[15]; Theoretical Foundations and Comparative Analysis investigate when and why each method excels, as seen in works like On-policy Off-policy Harmony[2] and Synergy Dilemma CoT[33]; Application Domains and Task-Specific Methods focus on deploying these techniques in mathematical reasoning (Bridging SFT RL Math[10]), code generation (SFT RL Correlation Code[17]), and vision-language tasks (Visual RFT[1], Reason RFT VLM[8]); Reinforcement Learning Techniques and Optimization develop novel RL algorithms and reward mechanisms such as SRFT[7] and Token-Efficient RL[39]; while Fine-Tuning Impact and Side Effects study how SFT influences downstream RL performance, exemplified by RL Squeezes SFT Expands[46] and Impact Fine-Tuning CoT[45]. A particularly active line of work centers on mathematical and code reasoning, where researchers debate optimal training sequences and the relative contributions of SFT versus RL. Teaching LLMs Reason[5] and Bridging SFT RL Math[10] explore foundational strategies for combining both paradigms in structured reasoning tasks, while works like REFT[13] and Reason RFT[4] propose refined rejection sampling and filtering techniques to improve data quality before RL training. AceReason Nemotron[0] situates itself within this mathematical and code reasoning cluster, emphasizing adaptive integration strategies that balance SFT's ability to provide strong initial reasoning patterns with RL's capacity for exploration and reward-driven refinement. Compared to Step-wise Adaptive Integration[3], which dynamically adjusts training phases, and SFT RL Correlation Code[17], which analyzes correlations in code tasks, AceReason Nemotron[0] focuses on achieving synergy through careful orchestration of both methods to maximize reasoning performance across diverse problem types.

Claimed Contributions

Systematic investigation of SFT data scaling strategies

The authors systematically explore two axes for scaling supervised fine-tuning data: increasing the number of unique prompts and increasing the number of responses per prompt. They find that scaling prompts yields more substantial gains than scaling responses per prompt, and observe consistent performance improvements across training epochs.

10 retrieved papers
Analysis of SFT-RL synergy and temperature selection guideline

The authors investigate how different SFT initializations affect final RL performance and establish that stronger SFT models lead to better outcomes when RL is conducted effectively. They provide a rule of thumb for setting sampling temperature to maintain temperature-adjusted entropy around 0.3 for effective exploration-exploitation balance.

9 retrieved papers
AceReason-Nemotron-1.1 7B model achieving state-of-the-art performance

The authors develop AceReason-Nemotron-1.1, a 7B parameter model that combines their strong SFT foundation with stage-wise RL training. The model achieves new state-of-the-art results among Qwen2.5-7B-based models on math and code benchmarks, validating their integrated post-training approach.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic investigation of SFT data scaling strategies

The authors systematically explore two axes for scaling supervised fine-tuning data: increasing the number of unique prompts and increasing the number of responses per prompt. They find that scaling prompts yields more substantial gains than scaling responses per prompt, and observe consistent performance improvements across training epochs.

Contribution

Analysis of SFT-RL synergy and temperature selection guideline

The authors investigate how different SFT initializations affect final RL performance and establish that stronger SFT models lead to better outcomes when RL is conducted effectively. They provide a rule of thumb for setting sampling temperature to maintain temperature-adjusted entropy around 0.3 for effective exploration-exploitation balance.

Contribution

AceReason-Nemotron-1.1 7B model achieving state-of-the-art performance

The authors develop AceReason-Nemotron-1.1, a 7B parameter model that combines their strong SFT foundation with stage-wise RL training. The model achieves new state-of-the-art results among Qwen2.5-7B-based models on math and code benchmarks, validating their integrated post-training approach.