AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

LLMReasoningReinforcement LearningSupervised Fine-tuningMathCode

In this work, we investigate the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in developing strong reasoning models. We begin by curating the SFT training data through two scaling strategies: increasing the number of collected prompts and the number of generated responses per prompt. Both approaches yield notable improvements in reasoning performance, with scaling the number of prompts resulting in more substantial gains. We then explore the following questions regarding the synergy between SFT and RL: (i) Does a stronger SFT model consistently lead to better final performance after large-scale RL training? (ii) How can we determine an appropriate sampling temperature during RL training to effectively balance exploration and exploitation for a given SFT initialization?
Our findings suggest that (i) holds true, provided effective RL training is conducted, particularly when the sampling temperature is carefully chosen to maintain the temperature-adjusted entropy around 0.3, a setting that strikes a good balance between exploration and exploitation. Notably, the performance gap between initial SFT models narrows significantly throughout the RL process. Built on a strong SFT foundation and SFT–RL synergy, our AceReason-Nemotron-1.1 7B model significantly outperforms AceReason-Nemotron-1.0 and achieves new state-of-the-art performance among Qwen2.5-7B-based reasoning models on challenging math and code benchmarks, thereby demonstrating the effectiveness of our post-training recipe.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper investigates how supervised fine-tuning (SFT) and reinforcement learning (RL) interact to produce strong reasoning models, focusing on data scaling strategies and temperature selection during RL training. It resides in the Mathematical and Code Reasoning leaf, which contains six papers addressing SFT-RL synergy in structured reasoning domains. This leaf sits within the broader Application Domains branch, indicating a moderately populated research direction where domain-specific methods are actively explored. The paper's emphasis on systematic scaling and temperature tuning positions it alongside works examining training orchestration and data quality in mathematical reasoning tasks.

The taxonomy reveals that Mathematical and Code Reasoning is one of several application-focused branches, with neighboring leaves covering Vision-Language Reasoning (16 papers across four sub-leaves) and Specialized Domain Applications (3 papers). The Integration Frameworks branch (12 papers across three leaves) explores general training paradigms, while Theoretical Foundations (9 papers across three leaves) examines mechanistic analyses and comparative studies. The paper's focus on practical training guidelines connects it to Sequential Training Strategies and Mechanistic Analysis leaves, though it remains grounded in mathematical reasoning applications rather than proposing domain-agnostic frameworks.

Among 29 candidates examined, no contributions were clearly refuted. The first contribution (SFT data scaling strategies) examined 10 candidates with zero refutable matches, suggesting limited prior work on systematic prompt versus response scaling comparisons. The second contribution (SFT-RL synergy and temperature selection) examined 9 candidates with no refutations, indicating that the specific temperature-entropy guideline (maintaining 0.3 entropy) may represent a novel empirical finding within the limited search scope. The third contribution (AceReason-Nemotron model) examined 10 candidates with no overlaps, though model releases are inherently unique artifacts. These statistics reflect a top-K semantic search, not exhaustive coverage.

Based on the limited literature search, the work appears to offer incremental advances in understanding SFT-RL interactions for mathematical reasoning. The temperature selection guideline and scaling analysis provide practical insights, though the absence of refutations may partly reflect the search scope rather than absolute novelty. The taxonomy context shows this is an active but not overcrowded research direction, with room for empirical studies that bridge training paradigms and domain-specific applications.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Synergy between supervised fine-tuning and reinforcement learning for reasoning models. The field explores how supervised fine-tuning (SFT) and reinforcement learning (RL) can be combined to enhance reasoning capabilities in language and multimodal models. The taxonomy reveals several major branches: Integration Frameworks and Training Paradigms examine how to orchestrate SFT and RL stages, including sequential pipelines and interleaved approaches like Interleaved Online Fine-Tuning[15]; Theoretical Foundations and Comparative Analysis investigate when and why each method excels, as seen in works like On-policy Off-policy Harmony[2] and Synergy Dilemma CoT[33]; Application Domains and Task-Specific Methods focus on deploying these techniques in mathematical reasoning (Bridging SFT RL Math[10]), code generation (SFT RL Correlation Code[17]), and vision-language tasks (Visual RFT[1], Reason RFT VLM[8]); Reinforcement Learning Techniques and Optimization develop novel RL algorithms and reward mechanisms such as SRFT[7] and Token-Efficient RL[39]; while Fine-Tuning Impact and Side Effects study how SFT influences downstream RL performance, exemplified by RL Squeezes SFT Expands[46] and Impact Fine-Tuning CoT[45]. A particularly active line of work centers on mathematical and code reasoning, where researchers debate optimal training sequences and the relative contributions of SFT versus RL. Teaching LLMs Reason[5] and Bridging SFT RL Math[10] explore foundational strategies for combining both paradigms in structured reasoning tasks, while works like REFT[13] and Reason RFT[4] propose refined rejection sampling and filtering techniques to improve data quality before RL training. AceReason Nemotron[0] situates itself within this mathematical and code reasoning cluster, emphasizing adaptive integration strategies that balance SFT's ability to provide strong initial reasoning patterns with RL's capacity for exploration and reward-driven refinement. Compared to Step-wise Adaptive Integration[3], which dynamically adjusts training phases, and SFT RL Correlation Code[17], which analyzes correlations in code tasks, AceReason Nemotron[0] focuses on achieving synergy through careful orchestration of both methods to maximize reasoning performance across diverse problem types.

Claimed Contributions

Systematic investigation of SFT data scaling strategies

10 retrieved papers

The authors systematically explore two axes for scaling supervised fine-tuning data: increasing the number of unique prompts and increasing the number of responses per prompt. They find that scaling prompts yields more substantial gains than scaling responses per prompt, and observe consistent performance improvements across training epochs.

10 retrieved papers

Analysis of SFT-RL synergy and temperature selection guideline

9 retrieved papers

The authors investigate how different SFT initializations affect final RL performance and establish that stronger SFT models lead to better outcomes when RL is conducted effectively. They provide a rule of thumb for setting sampling temperature to maintain temperature-adjusted entropy around 0.3 for effective exploration-exploitation balance.

9 retrieved papers

AceReason-Nemotron-1.1 7B model achieving state-of-the-art performance

10 retrieved papers

The authors develop AceReason-Nemotron-1.1, a 7B parameter model that combines their strong SFT foundation with stage-wise RL training. The model achieves new state-of-the-art results among Qwen2.5-7B-based models on math and code benchmarks, validating their integrated post-training approach.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Teaching large language models to reason with reinforcement learning PDF

Havrilla, Alex, Du Yuqing, Alex Havrilla, Raparthy, Sharath Chandra, Yuqing Du, Nalmpantis, Christoforos, S. Raparthy, Dwivedi-Yu, Jane, Christoforos Nalmpantis, Jane Dwivedi-Yu, Hambro, Eric, Maksym Zhuravinskyi, Sukhbaatar, Sainbayar, Eric Hambro, Raileanu, Roberta, Sainbayar Sukhbaatar, R. Raileanu (2024)

[10] Bridging Supervised Learning and Reinforcement Learning in Math Reasoning PDF

Chen Huayu, Zheng, Kaiwen, Huayu Chen, Zhang, Qinsheng, Kaiwen Zheng, Cui, Ganqu, Qinsheng Zhang, Yin, Ganqu Cui, Ye, Haotian, Yin Cui, Lin, Tsung-Yi, Haotian Ye, Liu Ming-yu, Tsung-Yi Lin, Zhu, Jun, Mingying Liu, Wang Hao-xiang, Jun Zhu, Haoxiang Wang (2025)

[13] Reft: Reasoning with reinforced fine-tuning PDF

Luong, Trung Quoc, Zhang Xin-bo, Trung Quoc Luong, Jie, Zhanming, Xinbo Zhang, Sun Peng, Zhanming Jie, Jin, Xiaoran, Peng Sun, Li Hang, Xiaoran Jin, Hang Li (2024)

[17] Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models PDF

Chen Jie, Jie Chen, Han, Xintian, Xintian Han, Ma Yu, Yu Ma, Zhou Xun, Xun Yu Zhou, Xiang Liang, Liang Xiang, Xun Zhou (2024)

[28] G1: Teaching LLMs to Reason on Graphs with Reinforcement Learning PDF

Guo Xiaojun, Li Ang, Xiaojun Guo, Wang Yi-fei, Ang Li, Jegelka, Stefanie, Yifei Wang, Wang, Yisen, Stefanie Jegelka, Yisen Wang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic investigation of SFT data scaling strategies

[60] Entropic distribution matching for supervised fine-tuning of LLMs: Less overfitting and better diversity PDF

Cannot Refute

[61] The best instruction-tuning data are those that fit PDF

Cannot Refute

[62] Matching tasks to objectives: Fine-tuning and prompt-tuning strategies for encoder-decoder pre-trained language models PDF

Cannot Refute

[63] A Self-Supervised Reinforcement Learning Approach for Fine-Tuning Large Language Models Using Cross-Attention Signals PDF

Cannot Refute

[64] Don't Stop Pretraining? Make Prompt-based Fine-tuning Powerful Learner PDF

Cannot Refute

[65] SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe PDF

Cannot Refute

[66] Codeplan: Unlocking reasoning potential in large language models by scaling code-form planning PDF

Cannot Refute

[67] Labeling supervised fine-tuning data with the scaling law PDF

Cannot Refute

[68] Frontier AI From the Outside In: Advances in Data Curation, Data Distillation and Model Evaluation PDF

Cannot Refute

[69] SLearnLLM: A Self-Learning Framework for Efficient Domain-Specific Adaptation of Large Language Models PDF

Cannot Refute

Contribution

Analysis of SFT-RL synergy and temperature selection guideline

[26] Beyond Two-Stage Training: Integrating SFT and RL for Improved Reasoning in LLMs PDF

Cannot Refute

[70] Reinforcement Learning with Supervised Alignment PDF

Cannot Refute

[71] MT: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning PDF

Cannot Refute

[72] â¦ -Technical Framework for Engagement-Optimized Short Text Generation in Digital Commerce Using Large Language Models and Reinforcement Learning PDF

Cannot Refute

[73] A Llama walks into the 'Bar': Efficient Supervised Fine-Tuning for Legal Reasoning in the Multi-state Bar Exam PDF

Cannot Refute

[74] Probing the Origins of Reasoning Performance: Representational Quality for Mathematical Problem-Solving in RL vs SFT Finetuned Models PDF

Cannot Refute

[75] Thyme: Think Beyond Images PDF

Cannot Refute

[76] Exploration Strategies for Reasoning Fine-tuning PDF

Cannot Refute

[77] REINFORCEMENT LEARNING PDF

Cannot Refute

Contribution

AceReason-Nemotron-1.1 7B model achieving state-of-the-art performance

[50] Reinforcement Learning Fine-Tuning of Language Model for Instruction Following and Math Reasoning PDF

Cannot Refute

[51] Model compression and efficient inference for large language models: A survey PDF

Cannot Refute

[52] To code or not to code? adaptive tool integration for math language models via expectation-maximization PDF

Cannot Refute

[53] Efficient inference for large reasoning models: A survey PDF

Cannot Refute

[54] LightReasoner: Can Small Language Models Teach Large Language Models Reasoning? PDF

Cannot Refute

[55] Improving Post-Training Quantization via Probabilistic Programming PDF

Cannot Refute

[56] Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math PDF

Cannot Refute

[57] THINKSLM: Towards Reasoning in Small Language Models PDF

Cannot Refute

[58] KV Cache Transform Coding for Compact Storage in LLM Inference PDF

Cannot Refute

[59] Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities PDF

Cannot Refute

AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Teaching large language models to reason with reinforcement learning PDF

[10] Bridging Supervised Learning and Reinforcement Learning in Math Reasoning PDF

[13] Reft: Reasoning with reinforced fine-tuning PDF

[17] Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models PDF

[28] G1: Teaching LLMs to Reason on Graphs with Reinforcement Learning PDF

Contribution Analysis

Systematic investigation of SFT data scaling strategies

[60] Entropic distribution matching for supervised fine-tuning of LLMs: Less overfitting and better diversity PDF

[61] The best instruction-tuning data are those that fit PDF

[62] Matching tasks to objectives: Fine-tuning and prompt-tuning strategies for encoder-decoder pre-trained language models PDF

[63] A Self-Supervised Reinforcement Learning Approach for Fine-Tuning Large Language Models Using Cross-Attention Signals PDF

[64] Don't Stop Pretraining? Make Prompt-based Fine-tuning Powerful Learner PDF

[65] SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe PDF

[66] Codeplan: Unlocking reasoning potential in large language models by scaling code-form planning PDF

[67] Labeling supervised fine-tuning data with the scaling law PDF

[68] Frontier AI From the Outside In: Advances in Data Curation, Data Distillation and Model Evaluation PDF

[69] SLearnLLM: A Self-Learning Framework for Efficient Domain-Specific Adaptation of Large Language Models PDF

Analysis of SFT-RL synergy and temperature selection guideline

[26] Beyond Two-Stage Training: Integrating SFT and RL for Improved Reasoning in LLMs PDF

[70] Reinforcement Learning with Supervised Alignment PDF

[71] MT: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning PDF

[72] â¦ -Technical Framework for Engagement-Optimized Short Text Generation in Digital Commerce Using Large Language Models and Reinforcement Learning PDF

[73] A Llama walks into the 'Bar': Efficient Supervised Fine-Tuning for Legal Reasoning in the Multi-state Bar Exam PDF

[74] Probing the Origins of Reasoning Performance: Representational Quality for Mathematical Problem-Solving in RL vs SFT Finetuned Models PDF

[75] Thyme: Think Beyond Images PDF

[76] Exploration Strategies for Reasoning Fine-tuning PDF

[77] REINFORCEMENT LEARNING PDF

AceReason-Nemotron-1.1 7B model achieving state-of-the-art performance

[50] Reinforcement Learning Fine-Tuning of Language Model for Instruction Following and Math Reasoning PDF

[51] Model compression and efficient inference for large language models: A survey PDF

[52] To code or not to code? adaptive tool integration for math language models via expectation-maximization PDF

[53] Efficient inference for large reasoning models: A survey PDF

[54] LightReasoner: Can Small Language Models Teach Large Language Models Reasoning? PDF

[55] Improving Post-Training Quantization via Probabilistic Programming PDF

[56] Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math PDF

[57] THINKSLM: Towards Reasoning in Small Language Models PDF

[58] KV Cache Transform Coding for Compact Storage in LLM Inference PDF

[59] Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities PDF

Table of Contents

[72] â¦ -Technical Framework for Engagement-Optimized Short Text Generation in Digital Commerce Using Large Language Models and Reinforcement Learning PDF