Pretraining Scaling Laws for Generative Evaluations of Language Models
Overview
Overall Novelty Assessment
The paper proposes three scaling laws for predicting pass-at-k performance on generative evaluations, using compute, parameters-and-tokens, or gold reference likelihoods as covariates. It resides in the 'Generative Task Scaling Laws' leaf, which contains only three papers total, including this one. This is a notably sparse research direction within the broader taxonomy of 31 papers across the field, suggesting that scaling laws specifically targeting generative benchmarks remain under-explored compared to discriminative tasks or pretraining loss prediction.
The taxonomy reveals neighboring leaves focused on discriminative task scaling, pretraining loss prediction, and observational scaling methodologies. The paper's emphasis on generative evaluations (mathematical problem-solving, code generation) distinguishes it from sibling work on discriminative transfer learning and from pretraining-only loss curves. By examining pass-at-k as a metric with its own hyperparameter k, the work diverges from traditional scaling studies that focus on single-point metrics. The scope note for this leaf explicitly excludes discriminative tasks and pretraining loss, positioning this contribution at the intersection of scaling theory and open-ended generation quality.
Among 22 candidates examined across three contributions, only one refutable pair emerged. The first contribution (three scaling laws for pass-at-k) examined 10 candidates with zero refutations, suggesting limited direct prior work on this specific formulation. The second contribution (k as a control lever) examined 2 candidates, also with no refutations. The third contribution (theoretical derivation connecting compute to parameters-and-tokens) examined 10 candidates and found 1 refutable match, indicating some overlap with existing scaling law theory. Given the limited search scope of 22 papers, these statistics suggest moderate novelty for the first two contributions and more substantial prior work for the theoretical derivation.
Based on top-22 semantic matches, the work appears to occupy a relatively open niche within generative task scaling. The sparse leaf structure and low refutation counts suggest that predicting pass-at-k from pretraining curves is less saturated than foundational scaling law research. However, the analysis does not cover exhaustive citation networks or domain-specific generative benchmarks outside the examined candidates, leaving open the possibility of additional relevant prior work in specialized venues or recent preprints.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce three distinct scaling laws that predict pass-at-k performance on generative benchmarks using different covariates: (1) pretraining compute, (2) model parameters and pretraining tokens, and (3) log likelihoods of gold reference solutions. Each law is rigorously fit and backtested to forecast the performance of expensive models from cheaper ones.
The authors demonstrate that the number of attempts per problem (k) in generative evaluations acts as a critical hyperparameter that modulates both the scaling law parameters and the predictability of performance, a feature not available in pretraining loss or discriminative evaluation scaling laws.
The authors prove that the compute-only scaling law is the compute-optimal envelope of the parameters-and-tokens scaling law, obtained by minimizing the objective under a fixed compute budget. From this, they derive a dimensionless misallocation penalty that quantitatively explains when and why pretraining recipes underperform compute-optimal scaling.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[5] Predictability and surprise in large generative models PDF
[30] Scaling Laws for Autoregressive Generative Modeling PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Three pretraining scaling laws for pass-at-k on generative evaluations
The authors introduce three distinct scaling laws that predict pass-at-k performance on generative benchmarks using different covariates: (1) pretraining compute, (2) model parameters and pretraining tokens, and (3) log likelihoods of gold reference solutions. Each law is rigorously fit and backtested to forecast the performance of expensive models from cheaper ones.
[34] Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? PDF
[35] -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains PDF
[36] Simko: Simple pass@ k policy optimization PDF
[37] {}-bench: A Benchmark for \underline{T}ool-\underline{A}gent-\underline{U}ser Interaction in Real-World Domains PDF
[38] Top Pass: improve code generation by pass@k-maximized code ranking PDF
[39] Depth-breadth synergy in rlvr: Unlocking llm reasoning gains with adaptive exploration PDF
[40] Beyond pass@ 1: Self-play with variational problem synthesis sustains rlvr PDF
[41] SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation PDF
[42] A simple model of inference scaling laws PDF
[43] Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models PDF
Identification of k as a control lever for scaling behavior
The authors demonstrate that the number of attempts per problem (k) in generative evaluations acts as a critical hyperparameter that modulates both the scaling law parameters and the predictability of performance, a feature not available in pretraining loss or discriminative evaluation scaling laws.
Theoretical derivation connecting compute law to parameters-and-tokens law
The authors prove that the compute-only scaling law is the compute-optimal envelope of the parameters-and-tokens scaling law, obtained by minimizing the objective under a fixed compute budget. From this, they derive a dimensionless misallocation penalty that quantitatively explains when and why pretraining recipes underperform compute-optimal scaling.