Relative Scaling Laws for LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Scaling LawsLinguistic VariationDomain ShiftAI Risk
Abstract:

Scaling laws describe how language models improve with additional data, parameters, and compute. While widely used, they are typically measured on aggregate test sets. Aggregate evaluations yield clean trends but average over heterogeneous subpopulations, obscuring performance disparities. We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than focusing solely on absolute error. Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP) budgets from 101810^{18}--102010^{20} FLOPs on standard pretraining datasets, we find diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; and clusters of AI risk behaviours split, with capability- and influence-related risks increasing during pretraining while adversarial risks do not. These results show that although scaling improves overall performance, it is not a universal equalizer. To support further study, we release all model checkpoints from this work to enable practitioners to measure relative alongside traditional scaling laws, in order to better prioritize robustness challenges in light of the bitter lesson

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than measuring aggregate error alone. It sits in the 'Relative and Heterogeneous Scaling Dynamics' leaf, which contains only three papers total, indicating a relatively sparse research direction. The sibling papers examine related questions about heterogeneous scaling behavior, but the taxonomy shows this subpopulation-focused perspective remains less explored than aggregate scaling law studies, which form a separate, more established leaf.

The taxonomy places this work within 'Scaling Laws and Distribution-Dependent Performance,' adjacent to branches on test-time compute scaling and aggregate scaling studies. Neighboring leaves address distribution shift characterization and test-time adaptation methods, reflecting the field's broader concern with robustness. The scope note for the paper's leaf explicitly excludes aggregate-only trends, positioning relative scaling laws as a complementary lens that examines whether scale acts as a universal equalizer or produces divergent trajectories across subpopulations.

Among 24 candidates examined across three contributions, none were flagged as clearly refuting the work. The relative scaling laws framework examined 10 candidates with zero refutations; the open-source IsoFLOP suite examined 4 with zero refutations; and the empirical case studies examined 10 with zero refutations. This suggests that within the limited search scope, the specific combination of tracking performance gaps across diverse test distributions under matched-compute budgets appears relatively unexplored, though the analysis does not claim exhaustive coverage of all prior scaling law research.

Based on the limited literature search, the work appears to occupy a distinct position within scaling law research by systematically measuring relative rather than absolute performance trends. The sparse population of its taxonomy leaf and absence of refuting candidates among those examined suggest novelty, though the search scope of 24 papers leaves open the possibility of relevant work outside the top semantic matches. The release of 255 model checkpoints may enable future comparative studies that were not captured in this analysis.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: tracking performance gaps between test distributions with scale. The field examines how model performance evolves across different test distributions as models and datasets grow, organizing research into several major branches. Scaling Laws and Distribution-Dependent Performance investigates how accuracy improvements vary heterogeneously across in-distribution versus out-of-distribution settings, with works like Relative Scaling Laws[0] and Beyond Neural Scaling[2] exploring whether all test scenarios benefit equally from increased compute. Distribution Shift Characterization and Benchmarking focuses on measuring and taxonomizing the types of shifts that occur in practice, while Test-Time Adaptation Methods and Training-Time Robustness strategies offer complementary approaches to closing performance gaps. Additional branches address theoretical foundations of generalization, cross-domain transfer, and rigorous evaluation methodology, reflecting the community's recognition that a single accuracy number often masks important disparities. Particularly active lines of work reveal tensions between scaling optimism and robustness challenges. Some studies suggest that larger models naturally improve worst-case performance, yet others document persistent or even widening gaps on certain distribution shifts, raising questions about whether scale alone suffices or whether targeted interventions remain necessary. Relative Scaling Laws[0] sits within the branch examining heterogeneous scaling dynamics, emphasizing how different test distributions respond differently to model scale—a perspective closely aligned with Beyond Neural Scaling[2], which questions uniform scaling benefits, and contrasting with works like Thinking Optimal Scaling[3] that explore allocation strategies. Nearby efforts such as Efficient Test Time Adaptation[5] and Test Time Robust Personalization[6] offer adaptive mechanisms to address gaps that scaling does not fully resolve, highlighting an ongoing dialogue about whether robustness emerges automatically or requires explicit design choices at training or test time.

Claimed Contributions

Relative scaling laws framework

The authors formalize a framework for measuring how performance disparities between different test distributions change as models scale, separating initial gaps from differences in improvement rates. This is formulated as a power law that indicates whether gaps narrow, persist, or widen with increased compute.

10 retrieved papers
Open-source IsoFLOP scaling suite of 255 models

The authors train and publicly release 255 decoder-only Transformer models under matched-compute budgets spanning three orders of magnitude across three distinct pretraining datasets. This resource enables reproducible study of both traditional and relative scaling laws.

4 retrieved papers
Empirical case studies demonstrating diverse relative scaling trajectories

The authors demonstrate the application of relative scaling laws across three distinct domains, revealing diverse trajectories including convergence of academic domains, mixed effects for regional English dialects, and divergence between capability-related and adversarial AI risks. These studies show that scale has non-uniform impacts on distributional robustness.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Relative scaling laws framework

The authors formalize a framework for measuring how performance disparities between different test distributions change as models scale, separating initial gaps from differences in improvement rates. This is formulated as a power law that indicates whether gaps narrow, persist, or widen with increased compute.

Contribution

Open-source IsoFLOP scaling suite of 255 models

The authors train and publicly release 255 decoder-only Transformer models under matched-compute budgets spanning three orders of magnitude across three distinct pretraining datasets. This resource enables reproducible study of both traditional and relative scaling laws.

Contribution

Empirical case studies demonstrating diverse relative scaling trajectories

The authors demonstrate the application of relative scaling laws across three distinct domains, revealing diverse trajectories including convergence of academic domains, mixed effects for regional English dialects, and divergence between capability-related and adversarial AI risks. These studies show that scale has non-uniform impacts on distributional robustness.

Relative Scaling Laws for LLMs | Novelty Validation