Pretraining Scaling Laws for Generative Evaluations of Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
language modelslarge language modelsscaling lawsevaluationsgenerative evaluationssampling
Abstract:

Neural scaling laws have driven the field's ever-expanding exponential growth in parameters, data and compute. While scaling behaviors for pretraining losses and discriminative benchmarks are well established, generative benchmarks such as mathematical problem-solving or software engineering remain under-explored. We propose and evaluate three different pretraining scaling laws for fitting pass-at-kk on generative evaluations and for predicting pass-at-kk of the most expensive model using cheaper models. Our three scaling laws differ in the covariates used: (1) pretraining compute, (2) model parameters and pretraining tokens, (3) log likelihoods of gold reference solutions. First, we demonstrate that generative evaluations introduce new hyperparameters (in our setting, kk) that act as a control lever for scaling behavior, modulating both the scaling law parameters and the predictability of performance. Second, we identify a stark difference in parameter stability: while the compute and parameters+tokens laws stabilize for only the last 1.52.5\mathord{\sim}1.5\mathord{-}2.5 orders of magnitude, the gold reference likelihood law is uniquely stable, converging across 5\mathord{\sim}5 orders. Third, in terms of predictive performance, we find all three scaling laws perform comparably, although the compute law predicts slightly worse for small kk and the gold reference law predicts slightly worse for large kk. Finally, we establish a theoretical connection, proving that the compute scaling law emerges as the compute-optimal envelope of the parameters-and-tokens law. Our framework provides researchers and practitioners with insights and methodologies to forecast generative performance, accelerating progress toward models that can reason, solve, and create.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes three scaling laws for predicting pass-at-k performance on generative evaluations, using compute, parameters-and-tokens, or gold reference likelihoods as covariates. It resides in the 'Generative Task Scaling Laws' leaf, which contains only three papers total, including this one. This is a notably sparse research direction within the broader taxonomy of 31 papers across the field, suggesting that scaling laws specifically targeting generative benchmarks remain under-explored compared to discriminative tasks or pretraining loss prediction.

The taxonomy reveals neighboring leaves focused on discriminative task scaling, pretraining loss prediction, and observational scaling methodologies. The paper's emphasis on generative evaluations (mathematical problem-solving, code generation) distinguishes it from sibling work on discriminative transfer learning and from pretraining-only loss curves. By examining pass-at-k as a metric with its own hyperparameter k, the work diverges from traditional scaling studies that focus on single-point metrics. The scope note for this leaf explicitly excludes discriminative tasks and pretraining loss, positioning this contribution at the intersection of scaling theory and open-ended generation quality.

Among 22 candidates examined across three contributions, only one refutable pair emerged. The first contribution (three scaling laws for pass-at-k) examined 10 candidates with zero refutations, suggesting limited direct prior work on this specific formulation. The second contribution (k as a control lever) examined 2 candidates, also with no refutations. The third contribution (theoretical derivation connecting compute to parameters-and-tokens) examined 10 candidates and found 1 refutable match, indicating some overlap with existing scaling law theory. Given the limited search scope of 22 papers, these statistics suggest moderate novelty for the first two contributions and more substantial prior work for the theoretical derivation.

Based on top-22 semantic matches, the work appears to occupy a relatively open niche within generative task scaling. The sparse leaf structure and low refutation counts suggest that predicting pass-at-k from pretraining curves is less saturated than foundational scaling law research. However, the analysis does not cover exhaustive citation networks or domain-specific generative benchmarks outside the examined candidates, leaving open the possibility of additional relevant prior work in specialized venues or recent preprints.

Taxonomy

Core-task Taxonomy Papers
31
3
Claimed Contributions
22
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: predicting generative evaluation performance from pretraining scaling laws. The field organizes around several complementary perspectives. At the broadest level, researchers study scaling laws for generative model performance prediction, examining how metrics evolve with compute and data. Modality-specific branches investigate scaling in vision, language, and multimodal settings, while domain-specific work targets applications such as recommendation systems, medical report generation, and drug discovery. Parallel efforts focus on pretraining data efficiency and architectural choices, alongside methodological advances in evaluation metrics and benchmarking practices. Together, these branches reflect a maturing understanding of how pretraining investments translate into downstream generative capabilities, with works like Pretraining from Pixels[3] and Autoregressive Modeling Scaling[30] illustrating early foundations, and more recent studies such as Predictability and Surprise[5] and Downstream Task Scaling[4] refining predictive frameworks. Within the scaling laws literature, a central tension emerges between predictable trends and emergent surprises: some capabilities scale smoothly while others appear discontinuously. Generative Evaluations Scaling[0] sits squarely in this active area, focusing on whether generative task performance can be reliably forecast from pretraining curves. It shares thematic ground with Predictability and Surprise[5], which explores when scaling laws hold versus when they break down, and with Downstream Task Scaling[4], which examines transfer to specific evaluation benchmarks. Compared to Autoregressive Modeling Scaling[30], which established foundational scaling relationships for next-token prediction, Generative Evaluations Scaling[0] emphasizes the generative evaluation regime where open-ended generation quality must be assessed. This positioning highlights ongoing questions about extrapolation limits, the role of task diversity, and whether pretraining metrics suffice to anticipate complex generative behaviors.

Claimed Contributions

Three pretraining scaling laws for pass-at-k on generative evaluations

The authors introduce three distinct scaling laws that predict pass-at-k performance on generative benchmarks using different covariates: (1) pretraining compute, (2) model parameters and pretraining tokens, and (3) log likelihoods of gold reference solutions. Each law is rigorously fit and backtested to forecast the performance of expensive models from cheaper ones.

10 retrieved papers
Identification of k as a control lever for scaling behavior

The authors demonstrate that the number of attempts per problem (k) in generative evaluations acts as a critical hyperparameter that modulates both the scaling law parameters and the predictability of performance, a feature not available in pretraining loss or discriminative evaluation scaling laws.

2 retrieved papers
Theoretical derivation connecting compute law to parameters-and-tokens law

The authors prove that the compute-only scaling law is the compute-optimal envelope of the parameters-and-tokens scaling law, obtained by minimizing the objective under a fixed compute budget. From this, they derive a dimensionless misallocation penalty that quantitatively explains when and why pretraining recipes underperform compute-optimal scaling.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Three pretraining scaling laws for pass-at-k on generative evaluations

The authors introduce three distinct scaling laws that predict pass-at-k performance on generative benchmarks using different covariates: (1) pretraining compute, (2) model parameters and pretraining tokens, and (3) log likelihoods of gold reference solutions. Each law is rigorously fit and backtested to forecast the performance of expensive models from cheaper ones.

Contribution

Identification of k as a control lever for scaling behavior

The authors demonstrate that the number of attempts per problem (k) in generative evaluations acts as a critical hyperparameter that modulates both the scaling law parameters and the predictability of performance, a feature not available in pretraining loss or discriminative evaluation scaling laws.

Contribution

Theoretical derivation connecting compute law to parameters-and-tokens law

The authors prove that the compute-only scaling law is the compute-optimal envelope of the parameters-and-tokens scaling law, obtained by minimizing the objective under a fixed compute budget. From this, they derive a dimensionless misallocation penalty that quantitatively explains when and why pretraining recipes underperform compute-optimal scaling.