Pretraining Scaling Laws for Generative Evaluations of Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

language modelslarge language modelsscaling lawsevaluationsgenerative evaluationssampling

Neural scaling laws have driven the field's ever-expanding exponential growth in parameters, data and compute. While scaling behaviors for pretraining losses and discriminative benchmarks are well established, generative benchmarks such as mathematical problem-solving or software engineering remain under-explored. We propose and evaluate three different pretraining scaling laws for fitting pass-at- $k$ on generative evaluations and for predicting pass-at- $k$ of the most expensive model using cheaper models. Our three scaling laws differ in the covariates used: (1) pretraining compute, (2) model parameters and pretraining tokens, (3) log likelihoods of gold reference solutions. First, we demonstrate that generative evaluations introduce new hyperparameters (in our setting, $k$ ) that act as a control lever for scaling behavior, modulating both the scaling law parameters and the predictability of performance. Second, we identify a stark difference in parameter stability: while the compute and parameters+tokens laws stabilize for only the last $\mathord{\sim}1.5\mathord{-}2.5$ orders of magnitude, the gold reference likelihood law is uniquely stable, converging across $\mathord{\sim}5$ orders. Third, in terms of predictive performance, we find all three scaling laws perform comparably, although the compute law predicts slightly worse for small $k$ and the gold reference law predicts slightly worse for large $k$ . Finally, we establish a theoretical connection, proving that the compute scaling law emerges as the compute-optimal envelope of the parameters-and-tokens law. Our framework provides researchers and practitioners with insights and methodologies to forecast generative performance, accelerating progress toward models that can reason, solve, and create.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes three scaling laws for predicting pass-at-k performance on generative evaluations, using compute, parameters-and-tokens, or gold reference likelihoods as covariates. It resides in the 'Generative Task Scaling Laws' leaf, which contains only three papers total, including this one. This is a notably sparse research direction within the broader taxonomy of 31 papers across the field, suggesting that scaling laws specifically targeting generative benchmarks remain under-explored compared to discriminative tasks or pretraining loss prediction.

The taxonomy reveals neighboring leaves focused on discriminative task scaling, pretraining loss prediction, and observational scaling methodologies. The paper's emphasis on generative evaluations (mathematical problem-solving, code generation) distinguishes it from sibling work on discriminative transfer learning and from pretraining-only loss curves. By examining pass-at-k as a metric with its own hyperparameter k, the work diverges from traditional scaling studies that focus on single-point metrics. The scope note for this leaf explicitly excludes discriminative tasks and pretraining loss, positioning this contribution at the intersection of scaling theory and open-ended generation quality.

Among 22 candidates examined across three contributions, only one refutable pair emerged. The first contribution (three scaling laws for pass-at-k) examined 10 candidates with zero refutations, suggesting limited direct prior work on this specific formulation. The second contribution (k as a control lever) examined 2 candidates, also with no refutations. The third contribution (theoretical derivation connecting compute to parameters-and-tokens) examined 10 candidates and found 1 refutable match, indicating some overlap with existing scaling law theory. Given the limited search scope of 22 papers, these statistics suggest moderate novelty for the first two contributions and more substantial prior work for the theoretical derivation.

Based on top-22 semantic matches, the work appears to occupy a relatively open niche within generative task scaling. The sparse leaf structure and low refutation counts suggest that predicting pass-at-k from pretraining curves is less saturated than foundational scaling law research. However, the analysis does not cover exhaustive citation networks or domain-specific generative benchmarks outside the examined candidates, leaving open the possibility of additional relevant prior work in specialized venues or recent preprints.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: predicting generative evaluation performance from pretraining scaling laws. The field organizes around several complementary perspectives. At the broadest level, researchers study scaling laws for generative model performance prediction, examining how metrics evolve with compute and data. Modality-specific branches investigate scaling in vision, language, and multimodal settings, while domain-specific work targets applications such as recommendation systems, medical report generation, and drug discovery. Parallel efforts focus on pretraining data efficiency and architectural choices, alongside methodological advances in evaluation metrics and benchmarking practices. Together, these branches reflect a maturing understanding of how pretraining investments translate into downstream generative capabilities, with works like Pretraining from Pixels[3] and Autoregressive Modeling Scaling[30] illustrating early foundations, and more recent studies such as Predictability and Surprise[5] and Downstream Task Scaling[4] refining predictive frameworks. Within the scaling laws literature, a central tension emerges between predictable trends and emergent surprises: some capabilities scale smoothly while others appear discontinuously. Generative Evaluations Scaling[0] sits squarely in this active area, focusing on whether generative task performance can be reliably forecast from pretraining curves. It shares thematic ground with Predictability and Surprise[5], which explores when scaling laws hold versus when they break down, and with Downstream Task Scaling[4], which examines transfer to specific evaluation benchmarks. Compared to Autoregressive Modeling Scaling[30], which established foundational scaling relationships for next-token prediction, Generative Evaluations Scaling[0] emphasizes the generative evaluation regime where open-ended generation quality must be assessed. This positioning highlights ongoing questions about extrapolation limits, the role of task diversity, and whether pretraining metrics suffice to anticipate complex generative behaviors.

Claimed Contributions

Three pretraining scaling laws for pass-at-k on generative evaluations

10 retrieved papers

The authors introduce three distinct scaling laws that predict pass-at-k performance on generative benchmarks using different covariates: (1) pretraining compute, (2) model parameters and pretraining tokens, and (3) log likelihoods of gold reference solutions. Each law is rigorously fit and backtested to forecast the performance of expensive models from cheaper ones.

10 retrieved papers

Identification of k as a control lever for scaling behavior

2 retrieved papers

The authors demonstrate that the number of attempts per problem (k) in generative evaluations acts as a critical hyperparameter that modulates both the scaling law parameters and the predictability of performance, a feature not available in pretraining loss or discriminative evaluation scaling laws.

2 retrieved papers

Theoretical derivation connecting compute law to parameters-and-tokens law

Can Refute

10 retrieved papers

The authors prove that the compute-only scaling law is the compute-optimal envelope of the parameters-and-tokens scaling law, obtained by minimizing the objective under a fixed compute budget. From this, they derive a dimensionless misallocation penalty that quantitatively explains when and why pretraining recipes underperform compute-optimal scaling.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Predictability and surprise in large generative models PDF

Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Nova Dassarma, Anna Chen, T. Henighan, Tom Conerly, Andy Jones, Nicholas Joseph, Dawn Drain, John Kernion, Nelson Elhage, Benjamin Mann, Sheer El-Showk, Stanislav Fort, Zac Hatfield Dodds, Tom Henighan, Scott Johnston, S. E. Showk, Jackson Kernian, Shauna Kravec, Zac Hatfield-Dodds, Ben Mann, Neel Nanda, Kamal Ndousse, Catherine Olsson, Daniela Amodei, Tom Brown, D. Amodei, Jared Kaplan, Dario Amodei, Sam McCandlish, Tom B. Brown, Christopher Olah, Jack Clark, Chris Olah (2022)

[30] Scaling Laws for Autoregressive Generative Modeling PDF

Henighan, Tom, Kaplan, Jared, T. Henighan, Katz, Mor, J. Kaplan, Chen, Mark, Mor Katz, Hesse, Christopher, Mark Chen, Jackson, Jacob, Christopher Hesse, Jun, Heewoo, Jacob Jackson, Brown, Tom B., Heewoo Jun, Dhariwal, Prafulla, Tom B. Brown, Gray Scott, Prafulla Dhariwal, Hallacy, Chris, Scott Gray, Mann, Benjamin, Chris Hallacy, Radford, Alec, Benjamin Mann, Ramesh, Aditya, Alec Radford, Ryder, Nick, A. Ramesh, Ziegler, Daniel M., Nick Ryder, Schulman, John, Daniel M. Ziegler, Amodei, Dario, John Schulman, McCandlish, Sam, Dario Amodei, Sam McCandlish (2020)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Three pretraining scaling laws for pass-at-k on generative evaluations

[34] Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? PDF

Cannot Refute

[35] -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains PDF

Cannot Refute

[36] Simko: Simple pass@ k policy optimization PDF

Cannot Refute

[37] {}-bench: A Benchmark for \underline{T}ool-\underline{A}gent-\underline{U}ser Interaction in Real-World Domains PDF

Cannot Refute

[38] Top Pass: improve code generation by pass@k-maximized code ranking PDF

Cannot Refute

[39] Depth-breadth synergy in rlvr: Unlocking llm reasoning gains with adaptive exploration PDF

Cannot Refute

[40] Beyond pass@ 1: Self-play with variational problem synthesis sustains rlvr PDF

Cannot Refute

[41] SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation PDF

Cannot Refute

[42] A simple model of inference scaling laws PDF

Cannot Refute

[43] Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models PDF

Cannot Refute

Contribution

Identification of k as a control lever for scaling behavior

[32] AgentHPO: Large language model agent for hyper-parameter optimization PDF

Cannot Refute

[33] Efficient configuration of heterogeneous resources and task scheduling strategies in deep learning auto-tuning systems PDF

Cannot Refute

Contribution

Theoretical derivation connecting compute law to parameters-and-tokens law

[44] Training compute-optimal large language models PDF

Can Refute

[45] Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models PDF

Cannot Refute

[46] An empirical analysis of compute-optimal large language model training PDF

Cannot Refute

[47] Scaling data-constrained language models PDF

Cannot Refute

[48] Beyond chinchilla-optimal: Accounting for inference in language model scaling laws PDF

Cannot Refute

[49] Compute-Optimal LLMs Provably Generalize Better with Scale PDF

Cannot Refute

[50] Compute optimal scaling of skills: Knowledge vs reasoning PDF

Cannot Refute

[51] Scaling with collapse: Efficient and predictable training of llm families PDF

Cannot Refute

[52] Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster PDF

Cannot Refute

[53] Training compute-optimal protein language models PDF

Cannot Refute

Pretraining Scaling Laws for Generative Evaluations of Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Predictability and surprise in large generative models PDF

[30] Scaling Laws for Autoregressive Generative Modeling PDF

Contribution Analysis

Three pretraining scaling laws for pass-at-k on generative evaluations

[34] Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? PDF

[35] -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains PDF

[36] Simko: Simple pass@ k policy optimization PDF

[37] {}-bench: A Benchmark for \underline{T}ool-\underline{A}gent-\underline{U}ser Interaction in Real-World Domains PDF

[38] Top Pass: improve code generation by pass@k-maximized code ranking PDF

[39] Depth-breadth synergy in rlvr: Unlocking llm reasoning gains with adaptive exploration PDF

[40] Beyond pass@ 1: Self-play with variational problem synthesis sustains rlvr PDF

[41] SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation PDF

[42] A simple model of inference scaling laws PDF

[43] Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models PDF

Identification of k as a control lever for scaling behavior

[32] AgentHPO: Large language model agent for hyper-parameter optimization PDF

[33] Efficient configuration of heterogeneous resources and task scheduling strategies in deep learning auto-tuning systems PDF

Theoretical derivation connecting compute law to parameters-and-tokens law

[44] Training compute-optimal large language models PDF

[45] Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models PDF

[46] An empirical analysis of compute-optimal large language model training PDF

[47] Scaling data-constrained language models PDF

[48] Beyond chinchilla-optimal: Accounting for inference in language model scaling laws PDF

[49] Compute-Optimal LLMs Provably Generalize Better with Scale PDF

[50] Compute optimal scaling of skills: Knowledge vs reasoning PDF

[51] Scaling with collapse: Efficient and predictable training of llm families PDF

[52] Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster PDF

[53] Training compute-optimal protein language models PDF

Table of Contents