Prompt Curriculum Learning for Efficient LLM Post-Training

ICLR 2026 Conference SubmissionAnonymous Authors
reinforcement learninglarge language modelspost-trainingcurriculum learning
Abstract:

Reinforcement learning (RL) is widely used to post-train large language models for tasks such as mathematical reasoning and coding. However, the convergence of RL training remains sensitive to batching and prompt selection strategies. We investigate the factors that affect convergence, including batch size and prompt difficulty. Through large-scale experiments across multiple models and datasets, we show that there exists an optimal batch size that balances generation time and gradient quality, and that prompts of intermediate difficulty (where the model has roughly a 50% chance of success) are the most sample-efficient for model convergence. Motivated by these findings, we propose Prompt Curriculum Learning (PCL), a lightweight algorithm that selects intermediate-difficulty prompts using a learned value model. PCL avoids costly rollouts and efficiently guides training by focusing on the most informative samples. Empirically, PCL either achieves the highest performance or requires significantly less training time to reach comparable performance across a suite of benchmarks. Compared to using rollouts to filter, PCL is 12.1×12.1\times and 16.9×16.9\times faster on identifying intermediate-difficulty prompts when training on MATH and DeepScaleR respectively.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Prompt Curriculum Learning (PCL), a method that selects intermediate-difficulty prompts using a learned value model to improve sample efficiency in RL post-training for LLMs. It resides in the 'Curriculum and Progressive Learning' leaf, which contains four papers total (including this one). This leaf sits within the broader 'Sample Efficiency and Data Optimization' branch, indicating a moderately populated research direction focused on data-centric efficiency strategies rather than algorithmic or system-level innovations.

The taxonomy reveals that curriculum-based methods occupy one of three sibling categories under sample efficiency, alongside 'Experience Replay and Off-Policy Methods' (four papers) and 'Data Pruning and Selection' (two papers). Neighboring branches address complementary efficiency concerns: 'System-Level Efficiency' tackles distributed training infrastructure, while 'Parameter-Efficient and Sparse Training' explores low-rank adapters and subnetwork approaches. PCL's focus on prompt difficulty sequencing distinguishes it from replay-based methods that reuse past trajectories and from static data selection techniques that filter without temporal ordering.

Among 23 candidates examined across three contributions, none clearly refute the paper's claims. The systematic investigation of batch size and prompt difficulty examined 10 candidates with zero refutations, suggesting limited prior work explicitly studying these convergence factors together. The PCL algorithm itself examined 3 candidates without refutation, and the value-model-based filtering approach examined 10 candidates, also without refutation. These statistics indicate that within the limited search scope, the specific combination of intermediate-difficulty prompt selection via learned value models appears relatively unexplored, though the broader curriculum learning paradigm is established.

Given the search examined 23 candidates from top-K semantic matches, the analysis captures closely related work but cannot claim exhaustive coverage. The absence of refutations suggests novelty within this scope, yet the moderately populated taxonomy leaf (four papers) indicates the curriculum learning direction is active. The paper's contribution appears to lie in operationalizing intermediate-difficulty selection through value models rather than rollouts, a practical refinement within an established paradigm rather than a fundamentally new research direction.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
23
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Efficient reinforcement learning for large language model post-training. The field has evolved into a rich taxonomy addressing diverse bottlenecks in aligning and refining LLMs through RL. At the highest level, the taxonomy organizes work into branches such as RL Algorithm Design and Optimization (exploring novel policy gradient methods and reward modeling techniques), Sample Efficiency and Data Optimization (focusing on curriculum learning, replay mechanisms, and data selection strategies), System-Level Efficiency and Scalability (tackling distributed training and memory management), Parameter-Efficient and Sparse Training (leveraging low-rank adapters and subnetwork approaches like RL Finetunes Subnetworks[3]), Reasoning and Task-Specific Applications (enhancing chain-of-thought and tool use), Training Dynamics and Theoretical Analysis (studying convergence and reward hacking), Continual Learning and Adaptation (mitigating catastrophic forgetting), Specialized Applications and Domains (from biomedical to interactive environments), and Empirical Studies and Comparative Analysis (benchmarking RLHF variants such as RLAIF vs RLHF[7] and Robust RLHF[1]). This structure reflects the community's recognition that post-training efficiency is multifaceted, requiring advances in algorithms, data usage, system design, and domain adaptation. Within Sample Efficiency and Data Optimization, a particularly active line of work explores curriculum and progressive learning strategies that adaptively sequence training examples to accelerate convergence and reduce sample complexity. Prompt Curriculum Learning[0] exemplifies this direction by systematically ordering prompts to guide the model through increasingly challenging scenarios, akin to pedagogical principles in human education. Nearby efforts such as Reverse Curriculum RL[46] and Sample-Centric Progressive Optimization[50] similarly manipulate task difficulty or sample presentation order, yet differ in whether they start from hard examples and work backward or dynamically adjust based on model performance. These approaches contrast with methods in other branches—such as Semantic Token Entropy[37] or Sample-Efficient RLHF[11]—that prioritize data quality or intrinsic exploration over explicit sequencing. Prompt Curriculum Learning[0] sits naturally among these curriculum-driven works, emphasizing structured progression as a lever for sample efficiency, while complementing broader efforts like Post-Training Scaling Survey[5] that examine how different efficiency techniques scale with model size and compute budgets.

Claimed Contributions

Systematic investigation of batch size and prompt difficulty effects on RL convergence

The authors conduct large-scale experiments to identify an optimal batch size that balances generation time and gradient quality, and demonstrate that prompts of intermediate difficulty (approximately 50% success rate) are most sample-efficient for convergence in RL-based LLM post-training.

10 retrieved papers
Prompt Curriculum Learning (PCL) algorithm

The authors introduce PCL, a lightweight algorithm that uses a learned value model to select intermediate-difficulty prompts via single forward passes, avoiding expensive rollout-based filtering while efficiently focusing training on informative samples.

3 retrieved papers
Efficient value-model-based prompt filtering approach

The authors develop an efficient prompt filtering method that uses a value model to estimate prompt difficulty with single forward passes, achieving 12.1× and 16.9× speedups compared to rollout-based filtering on MATH and DeepScaleR respectively.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic investigation of batch size and prompt difficulty effects on RL convergence

The authors conduct large-scale experiments to identify an optimal batch size that balances generation time and gradient quality, and demonstrate that prompts of intermediate difficulty (approximately 50% success rate) are most sample-efficient for convergence in RL-based LLM post-training.

Contribution

Prompt Curriculum Learning (PCL) algorithm

The authors introduce PCL, a lightweight algorithm that uses a learned value model to select intermediate-difficulty prompts via single forward passes, avoiding expensive rollout-based filtering while efficiently focusing training on informative samples.

Contribution

Efficient value-model-based prompt filtering approach

The authors develop an efficient prompt filtering method that uses a value model to estimate prompt difficulty with single forward passes, achieving 12.1× and 16.9× speedups compared to rollout-based filtering on MATH and DeepScaleR respectively.

Prompt Curriculum Learning for Efficient LLM Post-Training | Novelty Validation