Can Language Models Discover Scaling Laws?

ICLR 2026 Conference SubmissionAnonymous Authors
scaling law; agent; LLM
Abstract:

Discovering scaling laws for predicting model performance at scale is a fundamental and open-ended challenge, mostly reliant on slow, case specific human experimentation. To investigate the potential for LLMs to automate this process, we collect over 5,000 experiments from existing literature and curate seven diverse scaling law discovery tasks. While existing agents struggle to produce accurate law formulas, this paper introduces SLDAgent, an evolution-based agent that co-optimize the scaling law model and the parameters, enabling it to autonomously explore complex relationships between variables. For the first time, we demonstrates that SLDAgent can automatically discover laws that exhibit consistently more accurate extrapolation than their established, human-derived counterparts across all tasks. Through comprehensive analysis, we elucidate why these discovered laws are superior and verify their practical utility in both pretraining and finetuning applications. This work establishes a new paradigm for agentic scientific discovery, showing that AI systems can understand their own scaling behavior, and can contribute novel and practical knowledge back to the research community.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SLDAgent, an evolution-based system that autonomously discovers scaling law formulas from experimental data, and SLDBench, a benchmark comprising over 5,000 experiments across seven tasks. This work occupies the 'Automated Scaling Law Discovery' leaf in the taxonomy, which currently contains no sibling papers—making it the sole representative of this research direction. While the broader taxonomy encompasses 50 papers across 33 leaf nodes, this particular branch remains sparse, suggesting that automated discovery of scaling laws is an emerging rather than crowded area.

The taxonomy reveals substantial activity in adjacent branches: empirical characterization methods (Fundamental Compute-Loss Scaling, Temporal Dynamics), predictive modeling approaches (Observational Scaling Law Inference, Downstream Performance Prediction), and hyperparameter optimization (Hyperparameter and Training Configuration Scaling). The original paper diverges from these by proposing meta-level automation—using language models to discover laws rather than manually fitting empirical data or observationally inferring relationships. This positions the work at the intersection of predictive modeling and training method optimization, but with a fundamentally different mechanism: agentic exploration rather than human-guided experimentation or statistical extrapolation.

Among 26 candidates examined, the contribution-level analysis reveals mixed novelty signals. The benchmark contribution (SLDBench) examined 10 candidates with no clear refutations, suggesting this curation effort addresses a gap in standardized evaluation. The agent contribution (SLDAgent) examined 6 candidates and found 1 refutable match, indicating some overlap with prior automated discovery or optimization methods within this limited search scope. The superhuman performance claim examined 10 candidates without refutation, though this reflects the search scale rather than exhaustive validation. The statistics suggest moderate prior work density for the agent mechanism, but sparser coverage for benchmark construction and performance claims.

Given the limited search scope of 26 semantically similar papers, this analysis captures nearby work but cannot claim exhaustive coverage of all relevant optimization, meta-learning, or symbolic regression methods. The taxonomy structure and contribution-level statistics together suggest the work occupies a genuinely sparse research direction (automated scaling law discovery), though individual technical components (evolution-based search, formula optimization) may connect to broader literatures in automated machine learning and symbolic discovery not fully represented in this domain-specific search.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
26
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: automated discovery of scaling laws for language model performance. The field has matured into a rich taxonomy spanning empirical characterization of how loss and performance scale with compute, data, and model size; architecture-specific investigations into transformers, mixture-of-experts, and quantized models; data-centric studies examining quality, diversity, and synthetic data effects; and capability-specific analyses for reasoning, memorization, and multilingual performance. Branches also address training methods and hyperparameter tuning, predictive modeling techniques that enable observational inference without exhaustive training, model composition and merging dynamics, system-level considerations for distributed training, theoretical foundations rooted in information theory, large-scale empirical studies from industry labs, robustness and safety implications, and applications extending scaling insights to new domains. Representative works illustrate this breadth: Neural Scaling Laws[14] and Observational Scaling Laws[5] anchor empirical and predictive methods, while Pythia[29] and DeepSeek LLM[7] exemplify large-scale empirical studies, and Inference Scaling Laws[4] and Test-Time Compute Scaling[39] explore compute allocation beyond pretraining. Recent activity highlights tensions between observational efficiency and experimental rigor, with Observational Scaling Laws[5] enabling low-cost prediction while works like Algorithmic Progress LMs[1] and Temporal Scaling Law[2] track how algorithmic improvements shift scaling curves over time. The original paper, LMs Discover Scaling[0], sits squarely within the Automated Scaling Law Discovery branch, proposing that language models themselves can identify and formulate scaling relationships—a meta-level approach contrasting with manual empirical fitting or observational extrapolation. This automation theme connects to AutoScale[43] and Optimal Hyperparameter Scaling[46], which similarly seek to reduce human effort in characterizing scaling behavior. By leveraging models' own reasoning capabilities, LMs Discover Scaling[0] offers a novel complement to traditional methods, potentially accelerating the discovery process as models grow more capable and the space of architectural and training choices expands.

Claimed Contributions

SLDBench: A comprehensive scaling law discovery benchmark

The authors introduce SLDBench, a benchmark containing seven diverse scaling law discovery tasks derived from over 5,000 experiments in existing literature. Each task requires identifying a symbolic expression that accurately extrapolates to unseen test data, providing a rigorous testbed for evaluating agentic scientific discovery systems.

10 retrieved papers
SLDAgent: An evolution-based agent for scaling law discovery

The authors propose SLDAgent, a novel evolution-based agent that co-optimizes both the scaling law expression and its parameter fitting routine. This evolutionary approach enables autonomous exploration of complex variable relationships and achieves state-of-the-art performance on scaling law discovery tasks.

6 retrieved papers
Can Refute
Demonstration of superhuman scaling law discovery

The authors demonstrate for the first time that an AI agent can autonomously discover scaling laws that consistently outperform human-derived counterparts in extrapolation accuracy across all benchmark tasks. They validate the practical utility of these discovered laws in pretraining and fine-tuning applications.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SLDBench: A comprehensive scaling law discovery benchmark

The authors introduce SLDBench, a benchmark containing seven diverse scaling law discovery tasks derived from over 5,000 experiments in existing literature. Each task requires identifying a symbolic expression that accurately extrapolates to unseen test data, providing a rigorous testbed for evaluating agentic scientific discovery systems.

Contribution

SLDAgent: An evolution-based agent for scaling law discovery

The authors propose SLDAgent, a novel evolution-based agent that co-optimizes both the scaling law expression and its parameter fitting routine. This evolutionary approach enables autonomous exploration of complex variable relationships and achieves state-of-the-art performance on scaling law discovery tasks.

Contribution

Demonstration of superhuman scaling law discovery

The authors demonstrate for the first time that an AI agent can autonomously discover scaling laws that consistently outperform human-derived counterparts in extrapolation accuracy across all benchmark tasks. They validate the practical utility of these discovered laws in pretraining and fine-tuning applications.

Can Language Models Discover Scaling Laws? | Novelty Validation