Scaling Laws and Spectra of Shallow Neural Networks in the Feature Learning Regime

ICLR 2026 Conference SubmissionAnonymous Authors
Scaling laws; Neural networks; LASSO and matrix compressed sensing; Random matrix theory; Approximate message passing; High dimensional Statistics
Abstract:

Neural scaling laws underlie many of the recent advances in deep learning, yet their theoretical understanding remains largely confined to linear models. In this work, we present a systematic analysis of scaling laws for quadratic and diagonal neural networks in the feature learning regime. Leveraging connections with matrix compressed sensing and LASSO, we derive a detailed phase diagram for the scaling exponents of the excess risk as a function of sample complexity and weight decay. This analysis uncovers crossovers between distinct scaling regimes and plateau behaviors, mirroring phenomena widely reported in the empirical neural scaling literature. Furthermore, we establish a precise link between these regimes and the spectral properties of the trained network weights, which we characterize in detail. As a consequence, we provide a theoretical validation of recent empirical observations connecting the emergence of power-law tails in the weight spectrum with network generalization performance, yielding an interpretation from first principles.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper derives scaling laws and phase diagrams for quadratic and diagonal neural networks in the feature learning regime, connecting excess risk to sample complexity and weight decay through matrix compressed sensing and LASSO theory. It resides in the 'Scaling Laws and Phase Transitions' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy. The sibling papers in this leaf focus on related but distinct aspects: one examines hidden structure exploitation, another addresses scaling in different network configurations, suggesting this is an emerging area with limited prior theoretical characterization.

The taxonomy reveals that this work sits at the intersection of several active research threads. Its parent branch 'Theoretical Foundations of Feature Learning and Scaling' contains neighboring leaves on infinite-width limits, finite-width dynamics, and kernel-to-feature transitions—all addressing complementary aspects of feature learning theory. The paper's focus on finite-width shallow networks distinguishes it from infinite-width mean-field approaches while connecting to the broader question of when and how networks transition from kernel to feature learning regimes. The taxonomy's 'Empirical Scaling Behavior' branch contains parallel work on transformers and deep networks, highlighting that rigorous theoretical scaling analysis for shallow feature-learning networks occupies a distinct niche.

Among thirty candidates examined, the contribution-level analysis reveals mixed novelty signals. The first contribution (systematic scaling analysis for quadratic/diagonal networks) found zero refutable candidates among ten examined, suggesting this specific architectural focus may be novel within the limited search scope. The second contribution (spectral characterization across phases) similarly showed no clear refutations in ten candidates. However, the third contribution (theoretical validation of spectra-generalization connection) identified one refutable candidate among ten examined, indicating some prior theoretical work exists on linking weight spectra to generalization, though the specific first-principles derivation in this feature-learning context may still offer new insights.

The analysis suggests moderate novelty given the limited search scope of thirty semantically similar papers. The architectural focus on quadratic and diagonal networks in feature learning appears relatively unexplored, while the spectra-generalization connection has some theoretical precedent. The sparse population of the 'Scaling Laws and Phase Transitions' leaf (three papers) and the absence of clear refutations for most contributions indicate this work likely advances the theoretical understanding of shallow network scaling, though a more exhaustive literature search would be needed to definitively assess its novelty relative to the broader compressed sensing and statistical learning theory communities.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: understanding scaling laws for shallow neural networks in the feature learning regime. The field has organized itself around several complementary perspectives. Theoretical Foundations of Feature Learning and Scaling investigates phase transitions, infinite-width limits, and the mathematical underpinnings of how networks learn representations rather than merely interpolating via kernel methods—works such as Feature Learning Infinite Width[1] and Scaling Laws Hidden Structure[4] exemplify this branch. Empirical Scaling Behavior and Architecture Design focuses on practical scaling trends across different architectures, including transformers and alternative designs like Scaling Laws Transformers[3]. Feature Learning Mechanisms and Inductive Biases examines what kinds of features networks prefer to learn, including simplicity biases and the role of initialization, while Learning Algorithms and Training Procedures studies how optimization dynamics—such as large learning rates or greedy layerwise training—affect feature emergence. Application-Specific Architectures and Domain Adaptations and Network Analysis and Interpretability round out the taxonomy by addressing domain-specific constraints and methods for understanding learned representations. A particularly active line of work explores the transition from lazy (kernel) to rich (feature-learning) regimes as network width and training scale vary. Emergence Scaling SGD[28] and Feature Learning Scaling Laws[30] investigate how feature learning emerges with scale, while Simplicity Bias Shallow Networks[7] and Shallow Networks Curse Dimensionality[11] highlight trade-offs between expressiveness and sample complexity in shallow architectures. Scaling Laws Feature Learning[0] sits squarely within this theoretical cluster, focusing on rigorous characterizations of how shallow networks' scaling behavior changes when they actively learn features. Its emphasis on phase transitions and finite-width effects contrasts with the infinite-width perspective of Feature Learning Infinite Width[1] and complements the empirical focus of Feature Learning Scaling Laws[30], offering a bridge between asymptotic theory and practical scaling phenomena in the feature learning regime.

Claimed Contributions

Systematic analysis of scaling laws for quadratic and diagonal neural networks in feature learning regime

The authors provide a comprehensive theoretical characterization of how excess risk scales with sample size and regularization strength for two shallow network architectures (diagonal and quadratic networks) that exhibit genuine feature learning. They derive a complete phase diagram showing distinct scaling regimes and crossovers between them.

10 retrieved papers
Precise characterization of spectral properties of trained network weights across all phases

The authors derive exact formulas for the eigenvalue distribution of learned weights across all training phases. They show that learned weights are noisy, soft-thresholded versions of the target spectrum, with the spectrum consisting of spikes, bulk components, and zero eigenvalues depending on the training regime.

10 retrieved papers
First-principles theoretical validation of spectra-generalization connection

The authors establish a universal error decomposition that directly connects spectral features (bulk, spikes, outliers) to distinct error components (overfitting, underfitting, approximation error). This provides a mathematical foundation for empirical observations that heavy-tailed weight spectra correlate with better generalization.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic analysis of scaling laws for quadratic and diagonal neural networks in feature learning regime

The authors provide a comprehensive theoretical characterization of how excess risk scales with sample size and regularization strength for two shallow network architectures (diagonal and quadratic networks) that exhibit genuine feature learning. They derive a complete phase diagram showing distinct scaling regimes and crossovers between them.

Contribution

Precise characterization of spectral properties of trained network weights across all phases

The authors derive exact formulas for the eigenvalue distribution of learned weights across all training phases. They show that learned weights are noisy, soft-thresholded versions of the target spectrum, with the spectrum consisting of spikes, bulk components, and zero eigenvalues depending on the training regime.

Contribution

First-principles theoretical validation of spectra-generalization connection

The authors establish a universal error decomposition that directly connects spectral features (bulk, spikes, outliers) to distinct error components (overfitting, underfitting, approximation error). This provides a mathematical foundation for empirical observations that heavy-tailed weight spectra correlate with better generalization.

Scaling Laws and Spectra of Shallow Neural Networks in the Feature Learning Regime | Novelty Validation