CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Reinforcement learningLLM reasoningCurriculum learning

Curriculum learning plays a crucial role in enhancing the training efficiency of large language models (LLMs) on reasoning tasks. However, existing methods often fail to adequately account for variations in prompt difficulty or rely on simplistic filtering mechanisms to select prompt datasets within a narrow criterion range, resulting in significant computational waste. In this work, we approach the problem from the perspective of reinforcement learning gradient optimization, offering a systematic and theoretical investigation into how to improve the training efficiency of LLMs. We identify two key factors influencing training efficiency: the selection of training prompts and the allocation of rollout quantities across different prompts. Our theoretical analysis reveals that the sampling distribution of prompts dictates the convergence rate of gradient descent, while the allocation of the rollout quantity influences the consistency and stability of overall gradient updates. Based on these insights, we propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead. Experiments demonstrate that our CurES outperforms Group Relative Policy Optimization (GRPO) by \textbf{+3.3} points and \textbf{+4.82} points with 1.5B and 7B models, respectively, and exceeds the best prior sample efficient methods by \textbf{+2.12} points on average across eight math reasoning benchmarks. Our CurES also improves convergence speed compare to baselines such as GRPO.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes CurES, a curriculum learning method that optimizes training efficiency for reasoning LLMs by jointly addressing prompt selection and rollout allocation. It resides in the Difficulty-Based Curriculum Scheduling leaf, which contains seven papers including CurES itself. This leaf sits within the broader Curriculum Design and Optimization Strategies branch, indicating a moderately populated research direction focused on ordering training samples by difficulty. The taxonomy reveals that difficulty-based scheduling is one of four sibling approaches under curriculum design, suggesting this is an established but not overcrowded area with clear methodological boundaries.

The taxonomy structure shows that CurES's immediate neighbors include Adaptive Sample Selection and Allocation (five papers) and Progressive Multi-Stage Training Frameworks (four papers), both addressing related but distinct aspects of curriculum design. The Adaptive Sample Selection leaf focuses on dynamic resource allocation without fixed difficulty ordering, while Progressive Multi-Stage emphasizes phased training pipelines. CurES bridges these directions by combining difficulty-based scheduling with adaptive rollout allocation, positioning it at the intersection of static curriculum design and dynamic resource management. The exclude_note for Adaptive Sample Selection explicitly separates it from fixed difficulty methods, clarifying that CurES's difficulty-based foundation distinguishes it from purely adaptive approaches.

Among the three contributions analyzed, the theoretical analysis linking gradient efficiency to prompt difficulty examined four candidates with zero refutations, suggesting this framing may be relatively novel within the limited search scope. The CurES method itself examined ten candidates without clear refutation, indicating potential novelty in its specific combination of Bayesian estimation and curriculum scheduling. However, the optimal sampling distribution and rollout allocation formulas examined four candidates and found two refutable cases, suggesting this contribution has more substantial prior work. The analysis explicitly notes that only eighteen total candidates were examined across all contributions, meaning these findings reflect a targeted semantic search rather than exhaustive coverage.

Based on the limited search scope of eighteen candidates, the work appears to offer incremental advances in difficulty-based curriculum scheduling, particularly in its theoretical framing and Bayesian estimation approach. The presence of two refutable cases for the allocation formulas suggests that some core ideas have precedent, though the specific integration may differ. The taxonomy context indicates this is an active but not saturated research direction, with CurES contributing to ongoing efforts to formalize and optimize curriculum design for reasoning tasks.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: curriculum learning for training reasoning large language models. The field organizes itself around several complementary directions. Curriculum Design and Optimization Strategies focus on how to sequence training data by difficulty or other criteria, ensuring models progress from simpler to more complex reasoning tasks. Reinforcement Learning Approaches leverage reward signals and policy optimization to refine reasoning behavior, while Inference-Time Reasoning Enhancement explores methods that improve reasoning during deployment rather than training. Multimodal Reasoning and Vision-Language Integration extends these ideas to settings where models must reason over images and text together, as seen in works like Vision-R1[2] and Insight-V[4]. Knowledge Distillation and Data Synthesis address how to generate or transfer reasoning capabilities efficiently, and Cross-Domain and Auxiliary Training Strategies examine the use of code, tool use, or other auxiliary tasks to bootstrap reasoning skills. Theoretical Foundations and Survey Studies, such as Survey LLM Reasoning[40], provide overarching perspectives, while Specialized Application Domains and Architectural Innovations tackle domain-specific challenges and efficiency concerns. Within Curriculum Design and Optimization Strategies, a particularly active line of work centers on difficulty-based scheduling, where training examples are ordered to match the learner's evolving capacity. CurES[0] sits squarely in this branch, proposing a curriculum that adapts example difficulty dynamically during training. Nearby, Curriculum Commonsense[5] and Curriculum Easy to Hard[14] explore similar principles of progressive difficulty, though they may differ in how difficulty is measured or how transitions are triggered. Another closely related direction is Difficulty-Aware Staged[25], which also emphasizes staged progression but may incorporate different heuristics for stage boundaries. The central tension across these works is how to define and estimate difficulty reliably, and whether to use fixed schedules or adaptive mechanisms that respond to model performance. CurES[0] contributes to this conversation by offering a specific strategy for evolving the curriculum in response to learning signals, positioning itself among methods that prioritize dynamic adjustment over static ordering.

Claimed Contributions

Theoretical analysis linking gradient efficiency to prompt difficulty and rollout allocation

4 retrieved papers

The authors establish a theoretical framework showing that the sampling distribution of prompts dictates the convergence rate of gradient descent, while rollout quantity allocation influences gradient update consistency and stability. This analysis reveals that prompt difficulty, measured by model accuracy, caps optimization potential.

4 retrieved papers

CurES training method with Bayesian posterior estimation

10 retrieved papers

The authors propose CurES, a curriculum learning method that estimates prompt difficulty via question-answering accuracy, then reallocates prompt sampling probabilities and rollout quantities accordingly. The method uses Bayesian posterior estimation to progressively refine confidence in accuracy estimates using historical data, minimizing computational overhead while improving training robustness.

10 retrieved papers

Optimal sampling distribution and rollout allocation formulas

Can Refute

4 retrieved papers

The authors derive closed-form solutions for optimal prompt sampling distribution under entropy maximization constraints and optimal rollout quantity allocation that minimizes gradient variance. These formulas directly guide the practical implementation of CurES by connecting theoretical bounds to actionable training strategies.

4 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] On curriculum learning for commonsense reasoning PDF

Bansal, Mohit, Maharana, Adyasha (2022)

[9] Let's Be Self-generated via Step by Step: A Curriculum Learning Approach to Automated Reasoning with Large Language Models PDF

Ding Zichen, Li Xiang, Luo, Kangyang, Qiao, Lingfeng, Shu Jinlong, Yin, Di, Zhao Meng (2024)

[14] Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning PDF

Gui, Shurui, Li, Xiner, Ling, Hongyi, Li Eric, Zhang, Yu, Caverlee, James, Kalathil, Dileep, Ji, Shuiwang (2025)

[25] How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study PDF

Ji Yunjie, Zhao Si-tong, Tian Xiao-yu, Wang, Haotian, Peng Yi-ping, Zhao Han, Li, Xiangang (2025)

[44] What Makes a Good Curriculum? Disentangling the Effects of Data Ordering on LLM Mathematical Reasoning PDF

Jia Yaning, Zhang Chunhui, Ouyang, Zhongyu, Vosoughi, Soroush (2025)

[45] Training large language models for reasoning through reverse curriculum reinforcement learning PDF

Xi, Zhiheng, Chen Wen-xiang, Zhiheng Xi, Wenxiang Chen, Jin, Senjie, Boyang Hong, Zheng Rui, Senjie Jin, He Wei, Rui Zheng, Ding Yiwen, Wei He, Liu Shichun, Yiwen Ding, Guo Xin, Shichun Liu, Wang, Junzhe, Xin Guo, Guo Hong-lin, Junzhe Wang, Shen Wei, Honglin Guo, Fan, Xiaoran, Wei Shen, Zhou, Yuhao, Xiaoran Fan, Dou, Shihan, Yuhao Zhou, Wang Xiao, Shihan Dou, Zhang Xin-bo, Xiao Wang, Sun Peng, Xinbo Zhang, Gui, Tao, Peng Sun, Zhang Qi, Tao Gui, Huang, Xuanjing, Qi Zhang, Xuanjing Huang (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical analysis linking gradient efficiency to prompt difficulty and rollout allocation

[10] Prompt curriculum learning for efficient llm post-training PDF

Cannot Refute

[61] Gepa: Reflective prompt evolution can outperform reinforcement learning PDF

Cannot Refute

[62] Self-Guided Process Reward Optimization with Masked Step Advantage for Process Reinforcement Learning PDF

Cannot Refute

[63] SRT: Accelerating Reinforcement Learning via Speculative Rollout with Tree-Structured Cache PDF

Cannot Refute

Contribution

CurES training method with Bayesian posterior estimation

[51] BayesIntuit: A Neural Framework for Intuition-Based Reasoning PDF

Cannot Refute

[52] ROI-constrained bidding via curriculum-guided Bayesian reinforcement learning PDF

Cannot Refute

[53] Bayesian hypothesis generation: a probabilistic framework for evaluating novel hypotheses before data collection PDF

Cannot Refute

[54] Improving Environment Robustness of Deep Reinforcement Learning Approaches for Autonomous Racing Using Bayesian Optimization-based Curriculum Learning PDF

Cannot Refute

[55] Curriculum learning of Bayesian network structures PDF

Cannot Refute

[56] Understanding the Shades of Gray in DiagnosisâAn Online Course in Bayesian Reasoning PDF

Cannot Refute

[57] Bayesian reasoning in avalanche terrain: a theoretical investigation PDF

Cannot Refute

[58] for Intuition-Based Reasoning PDF

Cannot Refute

[59] Curriculum-Aware Cognitive Diagnosis via Graph Neural Networks PDF

Cannot Refute

[60] Challenge to the established curriculum: A collection of reflections PDF

Cannot Refute

Contribution

Optimal sampling distribution and rollout allocation formulas

[65] Reinforce-ada: An adaptive sampling framework for reinforce-style llm training PDF

Can Refute

[66] Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL PDF

Can Refute

[64] Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning PDF

Cannot Refute

[67] Improving Sampling Efficiency in RLVR through Adaptive Rollout and Response Reuse PDF

Cannot Refute

CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] On curriculum learning for commonsense reasoning PDF

[9] Let's Be Self-generated via Step by Step: A Curriculum Learning Approach to Automated Reasoning with Large Language Models PDF

[14] Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning PDF

[25] How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study PDF

[44] What Makes a Good Curriculum? Disentangling the Effects of Data Ordering on LLM Mathematical Reasoning PDF

[45] Training large language models for reasoning through reverse curriculum reinforcement learning PDF

Contribution Analysis

Theoretical analysis linking gradient efficiency to prompt difficulty and rollout allocation

[10] Prompt curriculum learning for efficient llm post-training PDF

[61] Gepa: Reflective prompt evolution can outperform reinforcement learning PDF

[62] Self-Guided Process Reward Optimization with Masked Step Advantage for Process Reinforcement Learning PDF

[63] SRT: Accelerating Reinforcement Learning via Speculative Rollout with Tree-Structured Cache PDF

CurES training method with Bayesian posterior estimation

[51] BayesIntuit: A Neural Framework for Intuition-Based Reasoning PDF

[52] ROI-constrained bidding via curriculum-guided Bayesian reinforcement learning PDF

[53] Bayesian hypothesis generation: a probabilistic framework for evaluating novel hypotheses before data collection PDF

[54] Improving Environment Robustness of Deep Reinforcement Learning Approaches for Autonomous Racing Using Bayesian Optimization-based Curriculum Learning PDF

[55] Curriculum learning of Bayesian network structures PDF

[56] Understanding the Shades of Gray in DiagnosisâAn Online Course in Bayesian Reasoning PDF

[57] Bayesian reasoning in avalanche terrain: a theoretical investigation PDF

[58] for Intuition-Based Reasoning PDF

[59] Curriculum-Aware Cognitive Diagnosis via Graph Neural Networks PDF

[60] Challenge to the established curriculum: A collection of reflections PDF

Optimal sampling distribution and rollout allocation formulas

[65] Reinforce-ada: An adaptive sampling framework for reinforce-style llm training PDF

[66] Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL PDF

[64] Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning PDF

[67] Improving Sampling Efficiency in RLVR through Adaptive Rollout and Response Reuse PDF

Table of Contents

[56] Understanding the Shades of Gray in DiagnosisâAn Online Course in Bayesian Reasoning PDF