Relative Scaling Laws for LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.5 Download Report PDF

Scaling LawsLinguistic VariationDomain ShiftAI Risk

Scaling laws describe how language models improve with additional data, parameters, and compute. While widely used, they are typically measured on aggregate test sets. Aggregate evaluations yield clean trends but average over heterogeneous subpopulations, obscuring performance disparities. We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than focusing solely on absolute error. Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP) budgets from $10^{18}$ -- $10^{20}$ FLOPs on standard pretraining datasets, we find diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; and clusters of AI risk behaviours split, with capability- and influence-related risks increasing during pretraining while adversarial risks do not. These results show that although scaling improves overall performance, it is not a universal equalizer. To support further study, we release all model checkpoints from this work to enable practitioners to measure relative alongside traditional scaling laws, in order to better prioritize robustness challenges in light of the bitter lesson

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than measuring aggregate error alone. It sits in the 'Relative and Heterogeneous Scaling Dynamics' leaf, which contains only three papers total, indicating a relatively sparse research direction. The sibling papers examine related questions about heterogeneous scaling behavior, but the taxonomy shows this subpopulation-focused perspective remains less explored than aggregate scaling law studies, which form a separate, more established leaf.

The taxonomy places this work within 'Scaling Laws and Distribution-Dependent Performance,' adjacent to branches on test-time compute scaling and aggregate scaling studies. Neighboring leaves address distribution shift characterization and test-time adaptation methods, reflecting the field's broader concern with robustness. The scope note for the paper's leaf explicitly excludes aggregate-only trends, positioning relative scaling laws as a complementary lens that examines whether scale acts as a universal equalizer or produces divergent trajectories across subpopulations.

Among 24 candidates examined across three contributions, none were flagged as clearly refuting the work. The relative scaling laws framework examined 10 candidates with zero refutations; the open-source IsoFLOP suite examined 4 with zero refutations; and the empirical case studies examined 10 with zero refutations. This suggests that within the limited search scope, the specific combination of tracking performance gaps across diverse test distributions under matched-compute budgets appears relatively unexplored, though the analysis does not claim exhaustive coverage of all prior scaling law research.

Based on the limited literature search, the work appears to occupy a distinct position within scaling law research by systematically measuring relative rather than absolute performance trends. The sparse population of its taxonomy leaf and absence of refuting candidates among those examined suggest novelty, though the search scope of 24 papers leaves open the possibility of relevant work outside the top semantic matches. The release of 255 model checkpoints may enable future comparative studies that were not captured in this analysis.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: tracking performance gaps between test distributions with scale. The field examines how model performance evolves across different test distributions as models and datasets grow, organizing research into several major branches. Scaling Laws and Distribution-Dependent Performance investigates how accuracy improvements vary heterogeneously across in-distribution versus out-of-distribution settings, with works like Relative Scaling Laws[0] and Beyond Neural Scaling[2] exploring whether all test scenarios benefit equally from increased compute. Distribution Shift Characterization and Benchmarking focuses on measuring and taxonomizing the types of shifts that occur in practice, while Test-Time Adaptation Methods and Training-Time Robustness strategies offer complementary approaches to closing performance gaps. Additional branches address theoretical foundations of generalization, cross-domain transfer, and rigorous evaluation methodology, reflecting the community's recognition that a single accuracy number often masks important disparities. Particularly active lines of work reveal tensions between scaling optimism and robustness challenges. Some studies suggest that larger models naturally improve worst-case performance, yet others document persistent or even widening gaps on certain distribution shifts, raising questions about whether scale alone suffices or whether targeted interventions remain necessary. Relative Scaling Laws[0] sits within the branch examining heterogeneous scaling dynamics, emphasizing how different test distributions respond differently to model scale—a perspective closely aligned with Beyond Neural Scaling[2], which questions uniform scaling benefits, and contrasting with works like Thinking Optimal Scaling[3] that explore allocation strategies. Nearby efforts such as Efficient Test Time Adaptation[5] and Test Time Robust Personalization[6] offer adaptive mechanisms to address gaps that scaling does not fully resolve, highlighting an ongoing dialogue about whether robustness emerges automatically or requires explicit design choices at training or test time.

Claimed Contributions

Relative scaling laws framework

10 retrieved papers

The authors formalize a framework for measuring how performance disparities between different test distributions change as models scale, separating initial gaps from differences in improvement rates. This is formulated as a power law that indicates whether gaps narrow, persist, or widen with increased compute.

10 retrieved papers

Open-source IsoFLOP scaling suite of 255 models

4 retrieved papers

The authors train and publicly release 255 decoder-only Transformer models under matched-compute budgets spanning three orders of magnitude across three distinct pretraining datasets. This resource enables reproducible study of both traditional and relative scaling laws.

4 retrieved papers

Empirical case studies demonstrating diverse relative scaling trajectories

10 retrieved papers

The authors demonstrate the application of relative scaling laws across three distinct domains, revealing diverse trajectories including convergence of academic domains, mixed effects for regional English dialects, and divergence between capability-related and adversarial AI risks. These studies show that scale has non-uniform impacts on distributional robustness.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Beyond neural scaling laws: beating power law scaling via data pruning PDF

Sorscher, Ben, Ben Sorscher, Geirhos, Robert, Robert Geirhos, Shekhar, Shashank, Shashank Shekhar, Ganguli, Surya, Surya Ganguli, Morcos, Ari S., Ari S. Morcos (2022) • Neural Information Processing Systems

[15] Bias as a Virtue: Rethinking Generalization under Distribution Shifts PDF

Chen, Ruixuan, Li Wentao, Xiao, Jiahui, Li, Yuchen, Tang Yimin, Wang Xiao-nan (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Relative scaling laws framework

[61] Genrep for first-shot unsupervised anomalous sound detection of dcase 2025 challenge PDF

Cannot Refute

[62] Unlocking high-accuracy differentially private image classification through scale PDF

Cannot Refute

[63] Delving deep into the generalization of vision transformers under distribution shifts PDF

Cannot Refute

[64] Data Contamination or Genuine Generalization? Disentangling LLM Performance on Benchmarks PDF

Cannot Refute

[65] Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective PDF

Cannot Refute

[66] Quartet: Native FP4 Training Can Be Optimal for Large Language Models PDF

Cannot Refute

[67] FairTune: A Bias-Aware Fine-Tuning Framework Towards Fair Heart Rate Prediction from PPG PDF

Cannot Refute

[68] Navigating the Accuracy-Size Trade-Off with Flexible Model Merging PDF

Cannot Refute

[69] Scaling up Masked Diffusion Models on Text PDF

Cannot Refute

[70] The evolution of the Black-White test score gap in Grades Kâ3: The fragility of results PDF

Cannot Refute

Contribution

Open-source IsoFLOP scaling suite of 255 models

[71] A comparative analysis of encoder only and decoder only models in intent classification and sentiment analysis: Navigating the trade-offs in model size and â¦ PDF

Cannot Refute

[72] Decoder-only architecture for streaming end-to-end speech recognition PDF

Cannot Refute

[73] Scaling Sparse and Dense Retrieval in Decoder-Only LLMs PDF

Cannot Refute

[74] Towards Neural Scaling Laws for Time Series Foundation Models PDF

Cannot Refute

Contribution

Empirical case studies demonstrating diverse relative scaling trajectories

[51] On the robustness of chatgpt: An adversarial and out-of-distribution perspective PDF

Cannot Refute

[52] Robust LLM Alignment via Distributionally Robust Direct Preference Optimization PDF

Cannot Refute

[53] Bridging Distribution Shift and AI Safety: Conceptual and Methodological Synergies PDF

Cannot Refute

[54] Navigating the safety landscape: Measuring risks in finetuning large language models PDF

Cannot Refute

[55] Uncertainty-Aware Trajectory Prediction via Rule-Regularized Heteroscedastic Deep Classification PDF

Cannot Refute

[56] JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift PDF

Cannot Refute

[57] Evaluating model robustness and stability to dataset shift PDF

Cannot Refute

[58] Alignment and safety in large language models: Safety mechanisms, training paradigms, and emerging challenges PDF

Cannot Refute

[59] ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models PDF

Cannot Refute

[60] Towards Robust Machine Learning under Distribution Shifts: From Causal Guarantees to Robust Federated Learning PDF

Cannot Refute

Relative Scaling Laws for LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Beyond neural scaling laws: beating power law scaling via data pruning PDF

[15] Bias as a Virtue: Rethinking Generalization under Distribution Shifts PDF

Contribution Analysis

Relative scaling laws framework

[61] Genrep for first-shot unsupervised anomalous sound detection of dcase 2025 challenge PDF

[62] Unlocking high-accuracy differentially private image classification through scale PDF

[63] Delving deep into the generalization of vision transformers under distribution shifts PDF

[64] Data Contamination or Genuine Generalization? Disentangling LLM Performance on Benchmarks PDF

[65] Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective PDF

[66] Quartet: Native FP4 Training Can Be Optimal for Large Language Models PDF

[67] FairTune: A Bias-Aware Fine-Tuning Framework Towards Fair Heart Rate Prediction from PPG PDF

[68] Navigating the Accuracy-Size Trade-Off with Flexible Model Merging PDF

[69] Scaling up Masked Diffusion Models on Text PDF

[70] The evolution of the Black-White test score gap in Grades Kâ3: The fragility of results PDF

Open-source IsoFLOP scaling suite of 255 models

[71] A comparative analysis of encoder only and decoder only models in intent classification and sentiment analysis: Navigating the trade-offs in model size and â¦ PDF

[72] Decoder-only architecture for streaming end-to-end speech recognition PDF

[73] Scaling Sparse and Dense Retrieval in Decoder-Only LLMs PDF

[74] Towards Neural Scaling Laws for Time Series Foundation Models PDF

Empirical case studies demonstrating diverse relative scaling trajectories

[51] On the robustness of chatgpt: An adversarial and out-of-distribution perspective PDF

[52] Robust LLM Alignment via Distributionally Robust Direct Preference Optimization PDF

[53] Bridging Distribution Shift and AI Safety: Conceptual and Methodological Synergies PDF

[54] Navigating the safety landscape: Measuring risks in finetuning large language models PDF

[55] Uncertainty-Aware Trajectory Prediction via Rule-Regularized Heteroscedastic Deep Classification PDF

[56] JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift PDF

[57] Evaluating model robustness and stability to dataset shift PDF

[58] Alignment and safety in large language models: Safety mechanisms, training paradigms, and emerging challenges PDF

[59] ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models PDF

[60] Towards Robust Machine Learning under Distribution Shifts: From Causal Guarantees to Robust Federated Learning PDF

Table of Contents

[70] The evolution of the Black-White test score gap in Grades Kâ3: The fragility of results PDF

[71] A comparative analysis of encoder only and decoder only models in intent classification and sentiment analysis: Navigating the trade-offs in model size and â¦ PDF