Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
training re-evaluation curvedata curriculum / data placementlarge language model (LLM) pre-trainingAdamW EMA timescalelearning-rate schedulestokens-per-parameter ratio
Abstract:

Data curriculums have become central to successful LLM training, yet principles governing optimal data placement remain unclear. We introduce the training re-evaluation curve (TREC), a diagnostic that retrospectively evaluates training batches using the final model weights. The TREC characterizes how well a trained model retains training data as a function of when the data was encountered during training. Analyzing TRECs for models from 111M to 3.9B parameters, we show that placing high-quality data at low points on the TREC significantly improves performance. Importantly, while a TREC is initially observable only after training, we demonstrate it can be predicted in advance from AdamW’s implicit EMA coefficients, enabling proactive curriculum design. By predicting TRECs for published training recipes, we explain prior ablations and reveal suboptimal data placements. We also align high-quality data with TREC minima in order to improve continual pre-training of a 3.9B-parameter LLM trained on 900B tokens.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the training re-evaluation curve (TREC) as a diagnostic for retrospectively assessing how well a trained model retains data encountered at different training stages, then demonstrates that TRECs can be predicted from AdamW's implicit EMA coefficients to enable proactive curriculum design. Within the taxonomy, it occupies the 'Predictive and Diagnostic Curriculum Design' leaf under 'Curriculum Construction Methods', where it is currently the sole member. This places it in a sparse, emerging research direction distinct from the more populated difficulty-based and adaptive curriculum branches.

The taxonomy reveals that most curriculum work clusters in neighboring leaves: 'Difficulty-Based Curriculum Construction' contains four papers using static metrics like perplexity or length, while 'Adaptive and Dynamic Curriculum Strategies' houses four papers that adjust curricula during training based on observed performance. The paper's predictive approach contrasts with these reactive or heuristic methods by forecasting retention patterns before training completes. Its connection to optimizer internals (AdamW EMA coefficients) also distinguishes it from data-centric methods in 'Data Efficacy and Organization Paradigms' and 'Data Selection and Filtering', which focus on sample quality or structural arrangement rather than training dynamics.

Among twenty-three candidates examined across three contributions, none were flagged as clearly refuting the work. The TREC diagnostic itself was assessed against ten candidates with zero refutations, as was the predictive model linking TRECs to AdamW coefficients. The large-scale empirical study connecting TRECs to optimizer timescales examined three candidates, also with no refutations. Given the limited search scope—top-K semantic matches plus citation expansion—these statistics suggest the specific combination of retrospective evaluation curves and optimizer-based prediction has not been extensively explored in prior curriculum literature, though the analysis does not claim exhaustive coverage.

Based on the examined candidates and taxonomy structure, the work appears to occupy a relatively novel position by bridging optimizer theory and curriculum design through predictive diagnostics. The absence of sibling papers in its taxonomy leaf and the zero-refutation outcome across all contributions indicate limited direct prior work within the search scope. However, the analysis is constrained by the twenty-three-candidate sample and does not rule out related ideas in broader optimization or training dynamics literature outside the curriculum-focused search.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
23
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: data curriculum design for large language model training. The field organizes around several major branches that reflect different stages and perspectives on how to sequence training data. Curriculum Learning Frameworks and Theoretical Foundations establish the conceptual underpinnings, while Curriculum Construction Methods focus on practical strategies for ordering examples—ranging from difficulty-based schedules like Length-Based Curriculum[8] to more sophisticated predictive approaches. Data Selection and Filtering addresses which samples to include or exclude, and Reinforcement Learning-Based Curriculum Approaches leverage RL signals to guide data ordering. Post-Training and Alignment Curricula deal with fine-tuning and preference learning stages, Application Domains and Task-Specific Curricula tailor strategies to particular use cases, Training Infrastructure and Optimization handle computational considerations, and Evaluation, Analysis, and Auxiliary Topics provide measurement and supporting techniques. Together, these branches capture the lifecycle from initial data preparation through deployment-ready models. A particularly active line of work explores predictive and diagnostic methods that anticipate training dynamics before committing full resources. Predicting Training Curves[0] exemplifies this direction by forecasting model behavior to inform curriculum decisions, contrasting with more heuristic orderings such as Curriculum Language Modeling[3] or Strategic Data Ordering[5], which rely on predefined difficulty metrics or domain knowledge. Another vibrant area involves adaptive curricula that evolve during training, as seen in Self-Evolving Curriculum[36] and AdaCuRL[35], which dynamically adjust data sequences based on ongoing performance. The original paper sits within the predictive and diagnostic cluster, emphasizing forward-looking analysis to optimize data scheduling. Compared to Strategic Data Ordering[5], which applies fixed rules, or Curriculum Language Modeling[3], which uses static difficulty proxies, Predicting Training Curves[0] offers a more anticipatory stance, aiming to reduce trial-and-error by modeling training trajectories in advance.

Claimed Contributions

Training re-evaluation curve (TREC) diagnostic

The authors introduce TREC, a diagnostic tool that measures how well a fully trained model performs on training batches as a function of when they appeared during training. They demonstrate that placing high-quality data at TREC minima significantly improves model performance.

10 retrieved papers
Predictive model for TRECs based on AdamW EMA coefficients

The authors show that TRECs can be predicted before training completes by using AdamW's exponential moving average (EMA) coefficients combined with a training-fraction adjustment term. This enables practitioners to design data curriculums proactively without costly trial-and-error.

10 retrieved papers
Large-scale empirical study connecting TRECs to AdamW timescale

The authors conduct a comprehensive study analyzing 600 TRECs across models ranging from 111M to 3.9B parameters and various dataset sizes. They establish that TREC shape is predominantly governed by the AdamW timescale, with tokens-per-parameter playing a secondary role.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Training re-evaluation curve (TREC) diagnostic

The authors introduce TREC, a diagnostic tool that measures how well a fully trained model performs on training batches as a function of when they appeared during training. They demonstrate that placing high-quality data at TREC minima significantly improves model performance.

Contribution

Predictive model for TRECs based on AdamW EMA coefficients

The authors show that TRECs can be predicted before training completes by using AdamW's exponential moving average (EMA) coefficients combined with a training-fraction adjustment term. This enables practitioners to design data curriculums proactively without costly trial-and-error.

Contribution

Large-scale empirical study connecting TRECs to AdamW timescale

The authors conduct a comprehensive study analyzing 600 TRECs across models ranging from 111M to 3.9B parameters and various dataset sizes. They establish that TREC shape is predominantly governed by the AdamW timescale, with tokens-per-parameter playing a secondary role.