CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Reinforcement learningLLM reasoningCurriculum learning
Abstract:

Curriculum learning plays a crucial role in enhancing the training efficiency of large language models (LLMs) on reasoning tasks. However, existing methods often fail to adequately account for variations in prompt difficulty or rely on simplistic filtering mechanisms to select prompt datasets within a narrow criterion range, resulting in significant computational waste. In this work, we approach the problem from the perspective of reinforcement learning gradient optimization, offering a systematic and theoretical investigation into how to improve the training efficiency of LLMs. We identify two key factors influencing training efficiency: the selection of training prompts and the allocation of rollout quantities across different prompts. Our theoretical analysis reveals that the sampling distribution of prompts dictates the convergence rate of gradient descent, while the allocation of the rollout quantity influences the consistency and stability of overall gradient updates. Based on these insights, we propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead. Experiments demonstrate that our CurES outperforms Group Relative Policy Optimization (GRPO) by \textbf{+3.3} points and \textbf{+4.82} points with 1.5B and 7B models, respectively, and exceeds the best prior sample efficient methods by \textbf{+2.12} points on average across eight math reasoning benchmarks. Our CurES also improves convergence speed compare to baselines such as GRPO.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes CurES, a curriculum learning method that optimizes training efficiency for reasoning LLMs by jointly addressing prompt selection and rollout allocation. It resides in the Difficulty-Based Curriculum Scheduling leaf, which contains seven papers including CurES itself. This leaf sits within the broader Curriculum Design and Optimization Strategies branch, indicating a moderately populated research direction focused on ordering training samples by difficulty. The taxonomy reveals that difficulty-based scheduling is one of four sibling approaches under curriculum design, suggesting this is an established but not overcrowded area with clear methodological boundaries.

The taxonomy structure shows that CurES's immediate neighbors include Adaptive Sample Selection and Allocation (five papers) and Progressive Multi-Stage Training Frameworks (four papers), both addressing related but distinct aspects of curriculum design. The Adaptive Sample Selection leaf focuses on dynamic resource allocation without fixed difficulty ordering, while Progressive Multi-Stage emphasizes phased training pipelines. CurES bridges these directions by combining difficulty-based scheduling with adaptive rollout allocation, positioning it at the intersection of static curriculum design and dynamic resource management. The exclude_note for Adaptive Sample Selection explicitly separates it from fixed difficulty methods, clarifying that CurES's difficulty-based foundation distinguishes it from purely adaptive approaches.

Among the three contributions analyzed, the theoretical analysis linking gradient efficiency to prompt difficulty examined four candidates with zero refutations, suggesting this framing may be relatively novel within the limited search scope. The CurES method itself examined ten candidates without clear refutation, indicating potential novelty in its specific combination of Bayesian estimation and curriculum scheduling. However, the optimal sampling distribution and rollout allocation formulas examined four candidates and found two refutable cases, suggesting this contribution has more substantial prior work. The analysis explicitly notes that only eighteen total candidates were examined across all contributions, meaning these findings reflect a targeted semantic search rather than exhaustive coverage.

Based on the limited search scope of eighteen candidates, the work appears to offer incremental advances in difficulty-based curriculum scheduling, particularly in its theoretical framing and Bayesian estimation approach. The presence of two refutable cases for the allocation formulas suggests that some core ideas have precedent, though the specific integration may differ. The taxonomy context indicates this is an active but not saturated research direction, with CurES contributing to ongoing efforts to formalize and optimize curriculum design for reasoning tasks.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
18
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: curriculum learning for training reasoning large language models. The field organizes itself around several complementary directions. Curriculum Design and Optimization Strategies focus on how to sequence training data by difficulty or other criteria, ensuring models progress from simpler to more complex reasoning tasks. Reinforcement Learning Approaches leverage reward signals and policy optimization to refine reasoning behavior, while Inference-Time Reasoning Enhancement explores methods that improve reasoning during deployment rather than training. Multimodal Reasoning and Vision-Language Integration extends these ideas to settings where models must reason over images and text together, as seen in works like Vision-R1[2] and Insight-V[4]. Knowledge Distillation and Data Synthesis address how to generate or transfer reasoning capabilities efficiently, and Cross-Domain and Auxiliary Training Strategies examine the use of code, tool use, or other auxiliary tasks to bootstrap reasoning skills. Theoretical Foundations and Survey Studies, such as Survey LLM Reasoning[40], provide overarching perspectives, while Specialized Application Domains and Architectural Innovations tackle domain-specific challenges and efficiency concerns. Within Curriculum Design and Optimization Strategies, a particularly active line of work centers on difficulty-based scheduling, where training examples are ordered to match the learner's evolving capacity. CurES[0] sits squarely in this branch, proposing a curriculum that adapts example difficulty dynamically during training. Nearby, Curriculum Commonsense[5] and Curriculum Easy to Hard[14] explore similar principles of progressive difficulty, though they may differ in how difficulty is measured or how transitions are triggered. Another closely related direction is Difficulty-Aware Staged[25], which also emphasizes staged progression but may incorporate different heuristics for stage boundaries. The central tension across these works is how to define and estimate difficulty reliably, and whether to use fixed schedules or adaptive mechanisms that respond to model performance. CurES[0] contributes to this conversation by offering a specific strategy for evolving the curriculum in response to learning signals, positioning itself among methods that prioritize dynamic adjustment over static ordering.

Claimed Contributions

Theoretical analysis linking gradient efficiency to prompt difficulty and rollout allocation

The authors establish a theoretical framework showing that the sampling distribution of prompts dictates the convergence rate of gradient descent, while rollout quantity allocation influences gradient update consistency and stability. This analysis reveals that prompt difficulty, measured by model accuracy, caps optimization potential.

4 retrieved papers
CurES training method with Bayesian posterior estimation

The authors propose CurES, a curriculum learning method that estimates prompt difficulty via question-answering accuracy, then reallocates prompt sampling probabilities and rollout quantities accordingly. The method uses Bayesian posterior estimation to progressively refine confidence in accuracy estimates using historical data, minimizing computational overhead while improving training robustness.

10 retrieved papers
Optimal sampling distribution and rollout allocation formulas

The authors derive closed-form solutions for optimal prompt sampling distribution under entropy maximization constraints and optimal rollout quantity allocation that minimizes gradient variance. These formulas directly guide the practical implementation of CurES by connecting theoretical bounds to actionable training strategies.

4 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical analysis linking gradient efficiency to prompt difficulty and rollout allocation

The authors establish a theoretical framework showing that the sampling distribution of prompts dictates the convergence rate of gradient descent, while rollout quantity allocation influences gradient update consistency and stability. This analysis reveals that prompt difficulty, measured by model accuracy, caps optimization potential.

Contribution

CurES training method with Bayesian posterior estimation

The authors propose CurES, a curriculum learning method that estimates prompt difficulty via question-answering accuracy, then reallocates prompt sampling probabilities and rollout quantities accordingly. The method uses Bayesian posterior estimation to progressively refine confidence in accuracy estimates using historical data, minimizing computational overhead while improving training robustness.

Contribution

Optimal sampling distribution and rollout allocation formulas

The authors derive closed-form solutions for optimal prompt sampling distribution under entropy maximization constraints and optimal rollout quantity allocation that minimizes gradient variance. These formulas directly guide the practical implementation of CurES by connecting theoretical bounds to actionable training strategies.