Prompt Curriculum Learning for Efficient LLM Post-Training

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

reinforcement learninglarge language modelspost-trainingcurriculum learning

Reinforcement learning (RL) is widely used to post-train large language models for tasks such as mathematical reasoning and coding. However, the convergence of RL training remains sensitive to batching and prompt selection strategies. We investigate the factors that affect convergence, including batch size and prompt difficulty. Through large-scale experiments across multiple models and datasets, we show that there exists an optimal batch size that balances generation time and gradient quality, and that prompts of intermediate difficulty (where the model has roughly a 50% chance of success) are the most sample-efficient for model convergence. Motivated by these findings, we propose Prompt Curriculum Learning (PCL), a lightweight algorithm that selects intermediate-difficulty prompts using a learned value model. PCL avoids costly rollouts and efficiently guides training by focusing on the most informative samples. Empirically, PCL either achieves the highest performance or requires significantly less training time to reach comparable performance across a suite of benchmarks. Compared to using rollouts to filter, PCL is $12.1\times$ and $16.9\times$ faster on identifying intermediate-difficulty prompts when training on MATH and DeepScaleR respectively.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Prompt Curriculum Learning (PCL), a method that selects intermediate-difficulty prompts using a learned value model to improve sample efficiency in RL post-training for LLMs. It resides in the 'Curriculum and Progressive Learning' leaf, which contains four papers total (including this one). This leaf sits within the broader 'Sample Efficiency and Data Optimization' branch, indicating a moderately populated research direction focused on data-centric efficiency strategies rather than algorithmic or system-level innovations.

The taxonomy reveals that curriculum-based methods occupy one of three sibling categories under sample efficiency, alongside 'Experience Replay and Off-Policy Methods' (four papers) and 'Data Pruning and Selection' (two papers). Neighboring branches address complementary efficiency concerns: 'System-Level Efficiency' tackles distributed training infrastructure, while 'Parameter-Efficient and Sparse Training' explores low-rank adapters and subnetwork approaches. PCL's focus on prompt difficulty sequencing distinguishes it from replay-based methods that reuse past trajectories and from static data selection techniques that filter without temporal ordering.

Among 23 candidates examined across three contributions, none clearly refute the paper's claims. The systematic investigation of batch size and prompt difficulty examined 10 candidates with zero refutations, suggesting limited prior work explicitly studying these convergence factors together. The PCL algorithm itself examined 3 candidates without refutation, and the value-model-based filtering approach examined 10 candidates, also without refutation. These statistics indicate that within the limited search scope, the specific combination of intermediate-difficulty prompt selection via learned value models appears relatively unexplored, though the broader curriculum learning paradigm is established.

Given the search examined 23 candidates from top-K semantic matches, the analysis captures closely related work but cannot claim exhaustive coverage. The absence of refutations suggests novelty within this scope, yet the moderately populated taxonomy leaf (four papers) indicates the curriculum learning direction is active. The paper's contribution appears to lie in operationalizing intermediate-difficulty selection through value models rather than rollouts, a practical refinement within an established paradigm rather than a fundamentally new research direction.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Efficient reinforcement learning for large language model post-training. The field has evolved into a rich taxonomy addressing diverse bottlenecks in aligning and refining LLMs through RL. At the highest level, the taxonomy organizes work into branches such as RL Algorithm Design and Optimization (exploring novel policy gradient methods and reward modeling techniques), Sample Efficiency and Data Optimization (focusing on curriculum learning, replay mechanisms, and data selection strategies), System-Level Efficiency and Scalability (tackling distributed training and memory management), Parameter-Efficient and Sparse Training (leveraging low-rank adapters and subnetwork approaches like RL Finetunes Subnetworks[3]), Reasoning and Task-Specific Applications (enhancing chain-of-thought and tool use), Training Dynamics and Theoretical Analysis (studying convergence and reward hacking), Continual Learning and Adaptation (mitigating catastrophic forgetting), Specialized Applications and Domains (from biomedical to interactive environments), and Empirical Studies and Comparative Analysis (benchmarking RLHF variants such as RLAIF vs RLHF[7] and Robust RLHF[1]). This structure reflects the community's recognition that post-training efficiency is multifaceted, requiring advances in algorithms, data usage, system design, and domain adaptation. Within Sample Efficiency and Data Optimization, a particularly active line of work explores curriculum and progressive learning strategies that adaptively sequence training examples to accelerate convergence and reduce sample complexity. Prompt Curriculum Learning[0] exemplifies this direction by systematically ordering prompts to guide the model through increasingly challenging scenarios, akin to pedagogical principles in human education. Nearby efforts such as Reverse Curriculum RL[46] and Sample-Centric Progressive Optimization[50] similarly manipulate task difficulty or sample presentation order, yet differ in whether they start from hard examples and work backward or dynamically adjust based on model performance. These approaches contrast with methods in other branches—such as Semantic Token Entropy[37] or Sample-Efficient RLHF[11]—that prioritize data quality or intrinsic exploration over explicit sequencing. Prompt Curriculum Learning[0] sits naturally among these curriculum-driven works, emphasizing structured progression as a lever for sample efficiency, while complementing broader efforts like Post-Training Scaling Survey[5] that examine how different efficiency techniques scale with model size and compute budgets.

Claimed Contributions

Systematic investigation of batch size and prompt difficulty effects on RL convergence

10 retrieved papers

The authors conduct large-scale experiments to identify an optimal batch size that balances generation time and gradient quality, and demonstrate that prompts of intermediate difficulty (approximately 50% success rate) are most sample-efficient for convergence in RL-based LLM post-training.

10 retrieved papers

Prompt Curriculum Learning (PCL) algorithm

3 retrieved papers

The authors introduce PCL, a lightweight algorithm that uses a learned value model to select intermediate-difficulty prompts via single forward passes, avoiding expensive rollout-based filtering while efficiently focusing training on informative samples.

3 retrieved papers

Efficient value-model-based prompt filtering approach

10 retrieved papers

The authors develop an efficient prompt filtering method that uses a value model to estimate prompt difficulty with single forward passes, achieving 12.1× and 16.9× speedups compared to rollout-based filtering on MATH and DeepScaleR respectively.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[37] Efficient Reinforcement Learning with Semantic and Token Entropy for LLM Reasoning PDF

Hongye Cao, Zhixin Bai, Ziyue Peng, Boyan Wang, Tianpei Yang, Jing Huo, Yuyao Zhang, Yang Gao (2025)

[46] Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning PDF

Xi, Zhiheng, Chen Wen-xiang, Zhiheng Xi, Wenxiang Chen, Jin, Senjie, Boyang Hong, Zheng Rui, Senjie Jin, He Wei, Rui Zheng, Ding Yiwen, Wei He, Liu Shichun, Yiwen Ding, Guo Xin, Shichun Liu, Wang, Junzhe, Xin Guo, Guo Hong-lin, Junzhe Wang, Shen Wei, Honglin Guo, Fan, Xiaoran, Wei Shen, Zhou, Yuhao, Xiaoran Fan, Dou, Shihan, Yuhao Zhou, Wang Xiao, Shihan Dou, Zhang Xin-bo, Xiao Wang, Sun Peng, Xinbo Zhang, Gui, Tao, Peng Sun, Zhang Qi, Tao Gui, Huang, Xuanjing, Qi Zhang, Xuanjing Huang (2024)

[50] From data-centric to sample-centric: Enhancing llm reasoning via progressive optimization PDF

Chen Xin-jie, Liao, Minpeng, Chen Guoxin, Li Chengxi, Fu-Biao, Fan Kai, Liu Xing-gao (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic investigation of batch size and prompt difficulty effects on RL convergence

[51] Dlpo: Towards a robust, efficient, and generalizable prompt optimization framework from a deep-learning perspective PDF

Cannot Refute

[52] ReSpace: Text-Driven 3D Scene Synthesis and Editing with Preference Alignment PDF

Cannot Refute

[53] Scaling drl for decision making: A survey on data, network, and training budget strategies PDF

Cannot Refute

[54] CoDaPo: Confidence and difficulty-adaptive policy optimization for post-training language models PDF

Cannot Refute

[55] RL for consistency models: Reward guided text-to-image generation with fast inference PDF

Cannot Refute

[56] Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size PDF

Cannot Refute

[57] Automatically identifying errors in primary level math word problems generated by large language models: a research report submitted to School of Mathematical and â¦ PDF

Cannot Refute

[58] Compositional Collapse Resistance via Spectrum-Fixed Probabilistic Lattice Injection in Large Language Model Architectures PDF

Cannot Refute

[59] Adaptive Test Time Compute for RL Fine Tuning PDF

Cannot Refute

[60] Supervised Fine-Tuning and Curriculum-Guided Direct Preference Optimization on Qwen2. 5-0.5 B PDF

Cannot Refute

Contribution

Prompt Curriculum Learning (PCL) algorithm

[71] Automatic Curriculum Learning through Value Disagreement PDF

Cannot Refute

[72] Curriculum goal masking for continuous deep reinforcement learning PDF

Cannot Refute

[73] Using instance hardness measures in curriculum learning PDF

Cannot Refute

Contribution

Efficient value-model-based prompt filtering approach

[61] RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning PDF

Cannot Refute

[62] Step-level value preference optimization for mathematical reasoning PDF

Cannot Refute

[63] Direct value optimization: Improving chain-of-thought reasoning in llms with refined values PDF

Cannot Refute

[64] QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning PDF

Cannot Refute

[65] Preference Adaptive and Sequential Text-to-Image Generation PDF

Cannot Refute

[66] Rethinking reinforcement learning for recommendation: A prompt perspective PDF

Cannot Refute

[67] VRPO: Rethinking Value Modeling for Robust RL Training under Noisy Supervision PDF

Cannot Refute

[68] Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization PDF

Cannot Refute

[69] ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding PDF

Cannot Refute

[70] Prompt-Tuning Bandits: Enabling Few-Shot Generalization for Efficient Multi-Task Offline RL PDF

Cannot Refute

Prompt Curriculum Learning for Efficient LLM Post-Training

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[37] Efficient Reinforcement Learning with Semantic and Token Entropy for LLM Reasoning PDF

[46] Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning PDF

[50] From data-centric to sample-centric: Enhancing llm reasoning via progressive optimization PDF

Contribution Analysis

Systematic investigation of batch size and prompt difficulty effects on RL convergence

[51] Dlpo: Towards a robust, efficient, and generalizable prompt optimization framework from a deep-learning perspective PDF

[52] ReSpace: Text-Driven 3D Scene Synthesis and Editing with Preference Alignment PDF

[53] Scaling drl for decision making: A survey on data, network, and training budget strategies PDF

[54] CoDaPo: Confidence and difficulty-adaptive policy optimization for post-training language models PDF

[55] RL for consistency models: Reward guided text-to-image generation with fast inference PDF

[56] Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size PDF

[57] Automatically identifying errors in primary level math word problems generated by large language models: a research report submitted to School of Mathematical and â¦ PDF

[58] Compositional Collapse Resistance via Spectrum-Fixed Probabilistic Lattice Injection in Large Language Model Architectures PDF

[59] Adaptive Test Time Compute for RL Fine Tuning PDF

[60] Supervised Fine-Tuning and Curriculum-Guided Direct Preference Optimization on Qwen2. 5-0.5 B PDF

Prompt Curriculum Learning (PCL) algorithm

[71] Automatic Curriculum Learning through Value Disagreement PDF

[72] Curriculum goal masking for continuous deep reinforcement learning PDF

[73] Using instance hardness measures in curriculum learning PDF

Efficient value-model-based prompt filtering approach

[61] RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning PDF

[62] Step-level value preference optimization for mathematical reasoning PDF

[63] Direct value optimization: Improving chain-of-thought reasoning in llms with refined values PDF

[64] QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning PDF

[65] Preference Adaptive and Sequential Text-to-Image Generation PDF

[66] Rethinking reinforcement learning for recommendation: A prompt perspective PDF

[67] VRPO: Rethinking Value Modeling for Robust RL Training under Noisy Supervision PDF

[68] Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization PDF

[69] ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding PDF

[70] Prompt-Tuning Bandits: Enabling Few-Shot Generalization for Efficient Multi-Task Offline RL PDF

Table of Contents

[57] Automatically identifying errors in primary level math word problems generated by large language models: a research report submitted to School of Mathematical and â¦ PDF