AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy
Overview
Overall Novelty Assessment
This paper investigates how supervised fine-tuning (SFT) and reinforcement learning (RL) interact to produce strong reasoning models, focusing on data scaling strategies and temperature selection during RL training. It resides in the Mathematical and Code Reasoning leaf, which contains six papers addressing SFT-RL synergy in structured reasoning domains. This leaf sits within the broader Application Domains branch, indicating a moderately populated research direction where domain-specific methods are actively explored. The paper's emphasis on systematic scaling and temperature tuning positions it alongside works examining training orchestration and data quality in mathematical reasoning tasks.
The taxonomy reveals that Mathematical and Code Reasoning is one of several application-focused branches, with neighboring leaves covering Vision-Language Reasoning (16 papers across four sub-leaves) and Specialized Domain Applications (3 papers). The Integration Frameworks branch (12 papers across three leaves) explores general training paradigms, while Theoretical Foundations (9 papers across three leaves) examines mechanistic analyses and comparative studies. The paper's focus on practical training guidelines connects it to Sequential Training Strategies and Mechanistic Analysis leaves, though it remains grounded in mathematical reasoning applications rather than proposing domain-agnostic frameworks.
Among 29 candidates examined, no contributions were clearly refuted. The first contribution (SFT data scaling strategies) examined 10 candidates with zero refutable matches, suggesting limited prior work on systematic prompt versus response scaling comparisons. The second contribution (SFT-RL synergy and temperature selection) examined 9 candidates with no refutations, indicating that the specific temperature-entropy guideline (maintaining 0.3 entropy) may represent a novel empirical finding within the limited search scope. The third contribution (AceReason-Nemotron model) examined 10 candidates with no overlaps, though model releases are inherently unique artifacts. These statistics reflect a top-K semantic search, not exhaustive coverage.
Based on the limited literature search, the work appears to offer incremental advances in understanding SFT-RL interactions for mathematical reasoning. The temperature selection guideline and scaling analysis provide practical insights, though the absence of refutations may partly reflect the search scope rather than absolute novelty. The taxonomy context shows this is an active but not overcrowded research direction, with room for empirical studies that bridge training paradigms and domain-specific applications.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors systematically explore two axes for scaling supervised fine-tuning data: increasing the number of unique prompts and increasing the number of responses per prompt. They find that scaling prompts yields more substantial gains than scaling responses per prompt, and observe consistent performance improvements across training epochs.
The authors investigate how different SFT initializations affect final RL performance and establish that stronger SFT models lead to better outcomes when RL is conducted effectively. They provide a rule of thumb for setting sampling temperature to maintain temperature-adjusted entropy around 0.3 for effective exploration-exploitation balance.
The authors develop AceReason-Nemotron-1.1, a 7B parameter model that combines their strong SFT foundation with stage-wise RL training. The model achieves new state-of-the-art results among Qwen2.5-7B-based models on math and code benchmarks, validating their integrated post-training approach.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[5] Teaching large language models to reason with reinforcement learning PDF
[10] Bridging Supervised Learning and Reinforcement Learning in Math Reasoning PDF
[13] Reft: Reasoning with reinforced fine-tuning PDF
[17] Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models PDF
[28] G1: Teaching LLMs to Reason on Graphs with Reinforcement Learning PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Systematic investigation of SFT data scaling strategies
The authors systematically explore two axes for scaling supervised fine-tuning data: increasing the number of unique prompts and increasing the number of responses per prompt. They find that scaling prompts yields more substantial gains than scaling responses per prompt, and observe consistent performance improvements across training epochs.
[60] Entropic distribution matching for supervised fine-tuning of LLMs: Less overfitting and better diversity PDF
[61] The best instruction-tuning data are those that fit PDF
[62] Matching tasks to objectives: Fine-tuning and prompt-tuning strategies for encoder-decoder pre-trained language models PDF
[63] A Self-Supervised Reinforcement Learning Approach for Fine-Tuning Large Language Models Using Cross-Attention Signals PDF
[64] Don't Stop Pretraining? Make Prompt-based Fine-tuning Powerful Learner PDF
[65] SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe PDF
[66] Codeplan: Unlocking reasoning potential in large language models by scaling code-form planning PDF
[67] Labeling supervised fine-tuning data with the scaling law PDF
[68] Frontier AI From the Outside In: Advances in Data Curation, Data Distillation and Model Evaluation PDF
[69] SLearnLLM: A Self-Learning Framework for Efficient Domain-Specific Adaptation of Large Language Models PDF
Analysis of SFT-RL synergy and temperature selection guideline
The authors investigate how different SFT initializations affect final RL performance and establish that stronger SFT models lead to better outcomes when RL is conducted effectively. They provide a rule of thumb for setting sampling temperature to maintain temperature-adjusted entropy around 0.3 for effective exploration-exploitation balance.
[26] Beyond Two-Stage Training: Integrating SFT and RL for Improved Reasoning in LLMs PDF
[70] Reinforcement Learning with Supervised Alignment PDF
[71] MT: Scaling MLLM-based Text Image Machine Translation via Multi-Task Reinforcement Learning PDF
[72] ⦠-Technical Framework for Engagement-Optimized Short Text Generation in Digital Commerce Using Large Language Models and Reinforcement Learning PDF
[73] A Llama walks into the 'Bar': Efficient Supervised Fine-Tuning for Legal Reasoning in the Multi-state Bar Exam PDF
[74] Probing the Origins of Reasoning Performance: Representational Quality for Mathematical Problem-Solving in RL vs SFT Finetuned Models PDF
[75] Thyme: Think Beyond Images PDF
[76] Exploration Strategies for Reasoning Fine-tuning PDF
[77] REINFORCEMENT LEARNING PDF
AceReason-Nemotron-1.1 7B model achieving state-of-the-art performance
The authors develop AceReason-Nemotron-1.1, a 7B parameter model that combines their strong SFT foundation with stage-wise RL training. The model achieves new state-of-the-art results among Qwen2.5-7B-based models on math and code benchmarks, validating their integrated post-training approach.