Prompt Curriculum Learning for Efficient LLM Post-Training
Overview
Overall Novelty Assessment
The paper proposes Prompt Curriculum Learning (PCL), a method that selects intermediate-difficulty prompts using a learned value model to improve sample efficiency in RL post-training for LLMs. It resides in the 'Curriculum and Progressive Learning' leaf, which contains four papers total (including this one). This leaf sits within the broader 'Sample Efficiency and Data Optimization' branch, indicating a moderately populated research direction focused on data-centric efficiency strategies rather than algorithmic or system-level innovations.
The taxonomy reveals that curriculum-based methods occupy one of three sibling categories under sample efficiency, alongside 'Experience Replay and Off-Policy Methods' (four papers) and 'Data Pruning and Selection' (two papers). Neighboring branches address complementary efficiency concerns: 'System-Level Efficiency' tackles distributed training infrastructure, while 'Parameter-Efficient and Sparse Training' explores low-rank adapters and subnetwork approaches. PCL's focus on prompt difficulty sequencing distinguishes it from replay-based methods that reuse past trajectories and from static data selection techniques that filter without temporal ordering.
Among 23 candidates examined across three contributions, none clearly refute the paper's claims. The systematic investigation of batch size and prompt difficulty examined 10 candidates with zero refutations, suggesting limited prior work explicitly studying these convergence factors together. The PCL algorithm itself examined 3 candidates without refutation, and the value-model-based filtering approach examined 10 candidates, also without refutation. These statistics indicate that within the limited search scope, the specific combination of intermediate-difficulty prompt selection via learned value models appears relatively unexplored, though the broader curriculum learning paradigm is established.
Given the search examined 23 candidates from top-K semantic matches, the analysis captures closely related work but cannot claim exhaustive coverage. The absence of refutations suggests novelty within this scope, yet the moderately populated taxonomy leaf (four papers) indicates the curriculum learning direction is active. The paper's contribution appears to lie in operationalizing intermediate-difficulty selection through value models rather than rollouts, a practical refinement within an established paradigm rather than a fundamentally new research direction.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors conduct large-scale experiments to identify an optimal batch size that balances generation time and gradient quality, and demonstrate that prompts of intermediate difficulty (approximately 50% success rate) are most sample-efficient for convergence in RL-based LLM post-training.
The authors introduce PCL, a lightweight algorithm that uses a learned value model to select intermediate-difficulty prompts via single forward passes, avoiding expensive rollout-based filtering while efficiently focusing training on informative samples.
The authors develop an efficient prompt filtering method that uses a value model to estimate prompt difficulty with single forward passes, achieving 12.1× and 16.9× speedups compared to rollout-based filtering on MATH and DeepScaleR respectively.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[37] Efficient Reinforcement Learning with Semantic and Token Entropy for LLM Reasoning PDF
[46] Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning PDF
[50] From data-centric to sample-centric: Enhancing llm reasoning via progressive optimization PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Systematic investigation of batch size and prompt difficulty effects on RL convergence
The authors conduct large-scale experiments to identify an optimal batch size that balances generation time and gradient quality, and demonstrate that prompts of intermediate difficulty (approximately 50% success rate) are most sample-efficient for convergence in RL-based LLM post-training.
[51] Dlpo: Towards a robust, efficient, and generalizable prompt optimization framework from a deep-learning perspective PDF
[52] ReSpace: Text-Driven 3D Scene Synthesis and Editing with Preference Alignment PDF
[53] Scaling drl for decision making: A survey on data, network, and training budget strategies PDF
[54] CoDaPo: Confidence and difficulty-adaptive policy optimization for post-training language models PDF
[55] RL for consistency models: Reward guided text-to-image generation with fast inference PDF
[56] Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size PDF
[57] Automatically identifying errors in primary level math word problems generated by large language models: a research report submitted to School of Mathematical and ⦠PDF
[58] Compositional Collapse Resistance via Spectrum-Fixed Probabilistic Lattice Injection in Large Language Model Architectures PDF
[59] Adaptive Test Time Compute for RL Fine Tuning PDF
[60] Supervised Fine-Tuning and Curriculum-Guided Direct Preference Optimization on Qwen2. 5-0.5 B PDF
Prompt Curriculum Learning (PCL) algorithm
The authors introduce PCL, a lightweight algorithm that uses a learned value model to select intermediate-difficulty prompts via single forward passes, avoiding expensive rollout-based filtering while efficiently focusing training on informative samples.
Efficient value-model-based prompt filtering approach
The authors develop an efficient prompt filtering method that uses a value model to estimate prompt difficulty with single forward passes, achieving 12.1× and 16.9× speedups compared to rollout-based filtering on MATH and DeepScaleR respectively.