Convex Dominance in Deep Learning: A Scaling Law of Loss and Learning Rate
Overview
Overall Novelty Assessment
The paper proposes that deep learning loss landscapes exhibit weak convexity after initial training, enabling sequence-to-sequence loss prediction and derivation of optimal learning rate scaling laws. It occupies the 'Convexity and Lipschitz Continuity Perspectives' leaf within the 'Theoretical Foundations and Scaling Laws' branch. Notably, this leaf contains only the original paper itself—no sibling papers share this specific focus on convexity emergence for schedule design. This isolation suggests the convexity-centric framing represents a relatively sparse research direction within the broader taxonomy of forty-five papers across thirty-six topics.
The taxonomy reveals that neighboring theoretical approaches pursue different mathematical frameworks: 'Kernel Regression and Intrinsic-Time Models' analyzes dynamics via kernel methods and stochastic differential equations, while 'Empirical Scaling Laws' fits multi-power functional forms without convexity assumptions. The 'Universality and Compute-Optimal Training' leaf studies normalized loss collapse under compute-optimal regimes. Meanwhile, the 'Loss Landscape Geometry and Training Stability' branch examines curvature and sharpness without assuming convexity. The paper's convexity lens thus diverges from both kernel-theoretic foundations and purely empirical curve-fitting, occupying a distinct methodological niche.
Among thirty candidates examined, the scaling law contribution (Contribution C) encountered one refutable candidate, while the convex-like behavior characterization (Contribution A) and convergence bound (Contribution B) each examined ten candidates with zero refutations. This limited search scope—thirty papers from semantic retrieval—means the analysis captures top-ranked matches but cannot claim exhaustive coverage. The single refutation for Contribution C suggests some prior work on loss-learning-rate scaling exists, whereas the convexity-based prediction framework (Contribution A) appears less directly addressed in the examined literature. The convergence bound's zero refutations may reflect its specific technical formulation rather than absolute novelty.
Given the restricted search scale and the paper's unique position as the sole occupant of its taxonomy leaf, the work appears to introduce a relatively underexplored theoretical perspective. However, the presence of one refutable candidate for the scaling law component indicates partial overlap with existing empirical scaling studies. The analysis reflects top-thirty semantic matches and does not constitute a comprehensive field survey, leaving open the possibility of additional relevant work outside this candidate set.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors establish that deep learning exhibits convex-like optimization dynamics across various architectures, optimizers, and learning rate schedules. They provide a non-asymptotic mapping that predicts loss sequences from learning rate sequences using an upper bound formulation generalized from convex analysis.
The authors prove that deep learning achieves optimal convergence rate of O(1/√T) when the peak learning rate is scaled by 1/√T and the learning rate schedule satisfies a qualifying condition. They introduce a training-free qualifying exam to determine which schedules achieve this rate.
The authors develop a two-dimensional scaling law that simultaneously predicts optimal loss and learning rate across different training horizons and model sizes. Their data-driven approach extrapolates up to 80× across training horizons and 70× across model sizes.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Convex-like behavior characterization in deep learning via sequence-to-sequence loss prediction
The authors establish that deep learning exhibits convex-like optimization dynamics across various architectures, optimizers, and learning rate schedules. They provide a non-asymptotic mapping that predicts loss sequences from learning rate sequences using an upper bound formulation generalized from convex analysis.
[55] Input Convex Neural Networks PDF
[56] Differentiable convex optimization layers PDF
[57] Gradient Descent Finds Global Minima of Deep Neural Networks PDF
[58] The Break-Even Point on Optimization Trajectories of Deep Neural Networks PDF
[59] Convex optimization for machine learning PDF
[60] Deep Learning without Poor Local Minima PDF
[61] Optimization landscape of neural networks PDF
[62] Neural networks are convex regularizers: Exact polynomial-time convex optimization formulations for two-layer networks PDF
[63] Convergence guarantees for gradient descent in deep neural networks with non-convex loss functions PDF
[64] Deep learning generalization and the convex hull of training sets PDF
Asymptotic O(1/√T) convergence bound for qualified learning rate schedules
The authors prove that deep learning achieves optimal convergence rate of O(1/√T) when the peak learning rate is scaled by 1/√T and the learning rate schedule satisfies a qualifying condition. They introduce a training-free qualifying exam to determine which schedules achieve this rate.
[9] The large learning rate phase of deep learning: the catapult mechanism PDF
[46] On the variance of the adaptive learning rate and beyond PDF
[47] Learning-rate-free learning by d-adaptation PDF
[48] Increased rates of convergence through learning rate adaptation PDF
[49] Adaptive stochastic conjugate gradient optimization for backpropagation neural networks PDF
[50] A comparative analysis of gradient descent-based optimization algorithms on convolutional neural networks PDF
[51] Super-convergence: Very fast training of neural networks using large learning rates PDF
[52] AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks PDF
[53] A stochastic gradient method with variance control and variable learning rate for Deep Learning PDF
[54] AMC: Adaptive Learning Rate Adjustment Based on Model Complexity PDF
Scaling law for loss and learning rate across training horizons and model sizes
The authors develop a two-dimensional scaling law that simultaneously predicts optimal loss and learning rate across different training horizons and model sizes. Their data-driven approach extrapolates up to 80× across training horizons and 70× across model sizes.