Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

training re-evaluation curvedata curriculum / data placementlarge language model (LLM) pre-trainingAdamW EMA timescalelearning-rate schedulestokens-per-parameter ratio

Data curriculums have become central to successful LLM training, yet principles governing optimal data placement remain unclear. We introduce the training re-evaluation curve (TREC), a diagnostic that retrospectively evaluates training batches using the final model weights. The TREC characterizes how well a trained model retains training data as a function of when the data was encountered during training. Analyzing TRECs for models from 111M to 3.9B parameters, we show that placing high-quality data at low points on the TREC significantly improves performance. Importantly, while a TREC is initially observable only after training, we demonstrate it can be predicted in advance from AdamW’s implicit EMA coefficients, enabling proactive curriculum design. By predicting TRECs for published training recipes, we explain prior ablations and reveal suboptimal data placements. We also align high-quality data with TREC minima in order to improve continual pre-training of a 3.9B-parameter LLM trained on 900B tokens.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the training re-evaluation curve (TREC) as a diagnostic for retrospectively assessing how well a trained model retains data encountered at different training stages, then demonstrates that TRECs can be predicted from AdamW's implicit EMA coefficients to enable proactive curriculum design. Within the taxonomy, it occupies the 'Predictive and Diagnostic Curriculum Design' leaf under 'Curriculum Construction Methods', where it is currently the sole member. This places it in a sparse, emerging research direction distinct from the more populated difficulty-based and adaptive curriculum branches.

The taxonomy reveals that most curriculum work clusters in neighboring leaves: 'Difficulty-Based Curriculum Construction' contains four papers using static metrics like perplexity or length, while 'Adaptive and Dynamic Curriculum Strategies' houses four papers that adjust curricula during training based on observed performance. The paper's predictive approach contrasts with these reactive or heuristic methods by forecasting retention patterns before training completes. Its connection to optimizer internals (AdamW EMA coefficients) also distinguishes it from data-centric methods in 'Data Efficacy and Organization Paradigms' and 'Data Selection and Filtering', which focus on sample quality or structural arrangement rather than training dynamics.

Among twenty-three candidates examined across three contributions, none were flagged as clearly refuting the work. The TREC diagnostic itself was assessed against ten candidates with zero refutations, as was the predictive model linking TRECs to AdamW coefficients. The large-scale empirical study connecting TRECs to optimizer timescales examined three candidates, also with no refutations. Given the limited search scope—top-K semantic matches plus citation expansion—these statistics suggest the specific combination of retrospective evaluation curves and optimizer-based prediction has not been extensively explored in prior curriculum literature, though the analysis does not claim exhaustive coverage.

Based on the examined candidates and taxonomy structure, the work appears to occupy a relatively novel position by bridging optimizer theory and curriculum design through predictive diagnostics. The absence of sibling papers in its taxonomy leaf and the zero-refutation outcome across all contributions indicate limited direct prior work within the search scope. However, the analysis is constrained by the twenty-three-candidate sample and does not rule out related ideas in broader optimization or training dynamics literature outside the curriculum-focused search.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: data curriculum design for large language model training. The field organizes around several major branches that reflect different stages and perspectives on how to sequence training data. Curriculum Learning Frameworks and Theoretical Foundations establish the conceptual underpinnings, while Curriculum Construction Methods focus on practical strategies for ordering examples—ranging from difficulty-based schedules like Length-Based Curriculum[8] to more sophisticated predictive approaches. Data Selection and Filtering addresses which samples to include or exclude, and Reinforcement Learning-Based Curriculum Approaches leverage RL signals to guide data ordering. Post-Training and Alignment Curricula deal with fine-tuning and preference learning stages, Application Domains and Task-Specific Curricula tailor strategies to particular use cases, Training Infrastructure and Optimization handle computational considerations, and Evaluation, Analysis, and Auxiliary Topics provide measurement and supporting techniques. Together, these branches capture the lifecycle from initial data preparation through deployment-ready models. A particularly active line of work explores predictive and diagnostic methods that anticipate training dynamics before committing full resources. Predicting Training Curves[0] exemplifies this direction by forecasting model behavior to inform curriculum decisions, contrasting with more heuristic orderings such as Curriculum Language Modeling[3] or Strategic Data Ordering[5], which rely on predefined difficulty metrics or domain knowledge. Another vibrant area involves adaptive curricula that evolve during training, as seen in Self-Evolving Curriculum[36] and AdaCuRL[35], which dynamically adjust data sequences based on ongoing performance. The original paper sits within the predictive and diagnostic cluster, emphasizing forward-looking analysis to optimize data scheduling. Compared to Strategic Data Ordering[5], which applies fixed rules, or Curriculum Language Modeling[3], which uses static difficulty proxies, Predicting Training Curves[0] offers a more anticipatory stance, aiming to reduce trial-and-error by modeling training trajectories in advance.

Claimed Contributions

Training re-evaluation curve (TREC) diagnostic

10 retrieved papers

The authors introduce TREC, a diagnostic tool that measures how well a fully trained model performs on training batches as a function of when they appeared during training. They demonstrate that placing high-quality data at TREC minima significantly improves model performance.

10 retrieved papers

Predictive model for TRECs based on AdamW EMA coefficients

10 retrieved papers

The authors show that TRECs can be predicted before training completes by using AdamW's exponential moving average (EMA) coefficients combined with a training-fraction adjustment term. This enables practitioners to design data curriculums proactively without costly trial-and-error.

10 retrieved papers

Large-scale empirical study connecting TRECs to AdamW timescale

3 retrieved papers

The authors conduct a comprehensive study analyzing 600 TRECs across models ranging from 111M to 3.9B parameters and various dataset sizes. They establish that TREC shape is predominantly governed by the AdamW timescale, with tokens-per-parameter playing a secondary role.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Training re-evaluation curve (TREC) diagnostic

[54] Model adaptation: Historical contrastive learning for unsupervised domain adaptation without source data PDF

Cannot Refute

[55] Learning to continually learn PDF

Cannot Refute

[56] A condition monitoring of steel box girder bridges based on consecutive system identification and model updating using a long-term monitoring database PDF

Cannot Refute

[57] Capturing the Temporal Dependence of Training Data Influence PDF

Cannot Refute

[58] Understanding new tasks through the lens of training data via exponential tilting PDF

Cannot Refute

[59] Analyzing the impact of artificial intelligence on operational efficiency in wastewater treatment: A comprehensive neutrosophic AHP-based SWOT analysis PDF

Cannot Refute

[60] Tools for Verifying Neural Models' Training Data PDF

Cannot Refute

[61] Rethinking classifier re-training in long-tailed recognition: A simple logits retargeting approach PDF

Cannot Refute

[62] Historical calibration of SVJD models with deep learning PDF

Cannot Refute

[63] CrossWeigh: Training named entity tagger from imperfect annotations PDF

Cannot Refute

Contribution

Predictive model for TRECs based on AdamW EMA coefficients

[64] C3-owd: A curriculum cross-modal contrastive learning framework for open-world detection PDF

Cannot Refute

[65] Curriculum learning by optimizing learning dynamics PDF

Cannot Refute

[66] Forecasting approaches in a higher education setting PDF

Cannot Refute

[67] Boosting Pseudo-Labeling With Curriculum Self-Reflection for Attributed Graph Clustering PDF

Cannot Refute

[68] Deep curriculum learning optimization PDF

Cannot Refute

[69] Enrollment forecasting for school management system PDF

Cannot Refute

[70] The use of the moving average forecasting, linear trend forecasting, and exponential smoothing forecasting techniques in education PDF

Cannot Refute

[71] Ladders of Thought: A Self-Evolving Curriculum of Progressively Simplified Reasoning Traces PDF

Cannot Refute

[72] An efficient training method for SAR object detection based on adaptive sample quality PDF

Cannot Refute

[73] Collaborative Filtering for Student Grade Analysis PDF

Cannot Refute

Contribution

Large-scale empirical study connecting TRECs to AdamW timescale

[51] How to set AdamW's weight decay as you scale model and dataset size PDF

Cannot Refute

[52] Power lines: Scaling laws for weight decay and batch size in llm pre-training PDF

Cannot Refute

[53] How to Scale Second-Order Optimization PDF

Cannot Refute

Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Training re-evaluation curve (TREC) diagnostic

[54] Model adaptation: Historical contrastive learning for unsupervised domain adaptation without source data PDF

[55] Learning to continually learn PDF

[56] A condition monitoring of steel box girder bridges based on consecutive system identification and model updating using a long-term monitoring database PDF

[57] Capturing the Temporal Dependence of Training Data Influence PDF

[58] Understanding new tasks through the lens of training data via exponential tilting PDF

[59] Analyzing the impact of artificial intelligence on operational efficiency in wastewater treatment: A comprehensive neutrosophic AHP-based SWOT analysis PDF

[60] Tools for Verifying Neural Models' Training Data PDF

[61] Rethinking classifier re-training in long-tailed recognition: A simple logits retargeting approach PDF

[62] Historical calibration of SVJD models with deep learning PDF

[63] CrossWeigh: Training named entity tagger from imperfect annotations PDF

Predictive model for TRECs based on AdamW EMA coefficients

[64] C3-owd: A curriculum cross-modal contrastive learning framework for open-world detection PDF

[65] Curriculum learning by optimizing learning dynamics PDF

[66] Forecasting approaches in a higher education setting PDF

[67] Boosting Pseudo-Labeling With Curriculum Self-Reflection for Attributed Graph Clustering PDF

[68] Deep curriculum learning optimization PDF

[69] Enrollment forecasting for school management system PDF

[70] The use of the moving average forecasting, linear trend forecasting, and exponential smoothing forecasting techniques in education PDF

[71] Ladders of Thought: A Self-Evolving Curriculum of Progressively Simplified Reasoning Traces PDF

[72] An efficient training method for SAR object detection based on adaptive sample quality PDF

[73] Collaborative Filtering for Student Grade Analysis PDF

Large-scale empirical study connecting TRECs to AdamW timescale

[51] How to set AdamW's weight decay as you scale model and dataset size PDF

[52] Power lines: Scaling laws for weight decay and batch size in llm pre-training PDF

[53] How to Scale Second-Order Optimization PDF

Table of Contents