Convex Dominance in Deep Learning: A Scaling Law of Loss and Learning Rate

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Convex optimizationScaling lawLearning rate transfer

Deep learning has non-convex loss landscape and its optimization dynamics is hard to analyze or control. Nevertheless, the dynamics can be empirically convex-like across various tasks, models, optimizers, hyperparameters, etc. In this work, we examine the applicability of convexity and Lipschitz continuity in deep learning, in order to precisely control the loss dynamics via the learning rate schedules. We illustrate that deep learning quickly becomes weakly convex after a short period of training, and the loss is predicable by an upper bound on the last iterate, which further informs the scaling of optimal learning rate. Through the lens of convexity, we build scaling laws of learning rates and losses that extrapolate as much as $80\times$ across training horizons and $70\times$ across model sizes.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes that deep learning loss landscapes exhibit weak convexity after initial training, enabling sequence-to-sequence loss prediction and derivation of optimal learning rate scaling laws. It occupies the 'Convexity and Lipschitz Continuity Perspectives' leaf within the 'Theoretical Foundations and Scaling Laws' branch. Notably, this leaf contains only the original paper itself—no sibling papers share this specific focus on convexity emergence for schedule design. This isolation suggests the convexity-centric framing represents a relatively sparse research direction within the broader taxonomy of forty-five papers across thirty-six topics.

The taxonomy reveals that neighboring theoretical approaches pursue different mathematical frameworks: 'Kernel Regression and Intrinsic-Time Models' analyzes dynamics via kernel methods and stochastic differential equations, while 'Empirical Scaling Laws' fits multi-power functional forms without convexity assumptions. The 'Universality and Compute-Optimal Training' leaf studies normalized loss collapse under compute-optimal regimes. Meanwhile, the 'Loss Landscape Geometry and Training Stability' branch examines curvature and sharpness without assuming convexity. The paper's convexity lens thus diverges from both kernel-theoretic foundations and purely empirical curve-fitting, occupying a distinct methodological niche.

Among thirty candidates examined, the scaling law contribution (Contribution C) encountered one refutable candidate, while the convex-like behavior characterization (Contribution A) and convergence bound (Contribution B) each examined ten candidates with zero refutations. This limited search scope—thirty papers from semantic retrieval—means the analysis captures top-ranked matches but cannot claim exhaustive coverage. The single refutation for Contribution C suggests some prior work on loss-learning-rate scaling exists, whereas the convexity-based prediction framework (Contribution A) appears less directly addressed in the examined literature. The convergence bound's zero refutations may reflect its specific technical formulation rather than absolute novelty.

Given the restricted search scale and the paper's unique position as the sole occupant of its taxonomy leaf, the work appears to introduce a relatively underexplored theoretical perspective. However, the presence of one refutable candidate for the scaling law component indicates partial overlap with existing empirical scaling studies. The analysis reflects top-thirty semantic matches and does not constitute a comprehensive field survey, leaving open the possibility of additional relevant work outside this candidate set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Predicting loss dynamics via learning rate schedules in deep learning. The field encompasses a broad spectrum of approaches, from theoretical foundations that explore scaling laws and convexity properties to practical methods for adaptive scheduling and application-specific tuning. At the highest level, the taxonomy reveals several major branches: Theoretical Foundations and Scaling Laws examine how loss evolves under different mathematical assumptions, often leveraging tools like Lipschitz continuity and convex analysis; Loss Landscape Geometry and Training Stability investigate the shape of the optimization surface and its impact on convergence; Adaptive and Automated Learning Rate Scheduling develops data-driven or feedback-based mechanisms to adjust rates on the fly; Parametric and Heuristic Schedules catalog classical decay patterns and their variants; Hyperparameter Transfer and Scaling address how schedules generalize across model sizes and datasets; Large-Batch and Distributed Training focuses on challenges unique to parallel computation; Continual and Transfer Learning Dynamics study how schedules must adapt when tasks or data distributions shift; Application-Specific Scheduling Methods tailor strategies to domains like vision or language; and Empirical Analysis and Heuristics distill practical insights from large-scale experiments. Representative works such as Functional Scaling Laws[1] and Multi-Power Law[2] illustrate the theoretical side, while AdaLRS[7] and Incremental PID Scheduler[22] exemplify adaptive automation. Within this landscape, a particularly active line of inquiry contrasts rigorous theoretical guarantees with flexible, empirical heuristics. On one hand, studies like Convex Dominance[0] and Loss Curvature Perspective[12] seek to ground schedule design in formal properties of the loss surface, aiming to predict dynamics under well-defined smoothness or convexity conditions. On the other hand, adaptive methods such as Adaptive Learning Rate[4] and Volatility-based Scheduler[43] prioritize responsiveness to observed training signals, often sacrificing theoretical clarity for practical robustness. Convex Dominance[0] sits squarely in the theoretical camp, emphasizing convexity and Lipschitz continuity perspectives to derive principled predictions of loss trajectories. This contrasts with nearby empirical approaches like Closer Look Heuristics[45], which distill rules of thumb from extensive experimentation, and with adaptive frameworks like AdaLRS[7], which automate schedule adjustments without requiring strong geometric assumptions. The central tension remains how to balance mathematical rigor with the flexibility needed for diverse architectures and tasks.

Claimed Contributions

Convex-like behavior characterization in deep learning via sequence-to-sequence loss prediction

10 retrieved papers

The authors establish that deep learning exhibits convex-like optimization dynamics across various architectures, optimizers, and learning rate schedules. They provide a non-asymptotic mapping that predicts loss sequences from learning rate sequences using an upper bound formulation generalized from convex analysis.

10 retrieved papers

Asymptotic O(1/√T) convergence bound for qualified learning rate schedules

10 retrieved papers

The authors prove that deep learning achieves optimal convergence rate of O(1/√T) when the peak learning rate is scaled by 1/√T and the learning rate schedule satisfies a qualifying condition. They introduce a training-free qualifying exam to determine which schedules achieve this rate.

10 retrieved papers

Scaling law for loss and learning rate across training horizons and model sizes

Can Refute

10 retrieved papers

The authors develop a two-dimensional scaling law that simultaneously predicts optimal loss and learning rate across different training horizons and model sizes. Their data-driven approach extrapolates up to 80× across training horizons and 70× across model sizes.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Convex-like behavior characterization in deep learning via sequence-to-sequence loss prediction

[55] Input Convex Neural Networks PDF

Cannot Refute

[56] Differentiable convex optimization layers PDF

Cannot Refute

[57] Gradient Descent Finds Global Minima of Deep Neural Networks PDF

Cannot Refute

[58] The Break-Even Point on Optimization Trajectories of Deep Neural Networks PDF

Cannot Refute

[59] Convex optimization for machine learning PDF

Cannot Refute

[60] Deep Learning without Poor Local Minima PDF

Cannot Refute

[61] Optimization landscape of neural networks PDF

Cannot Refute

[62] Neural networks are convex regularizers: Exact polynomial-time convex optimization formulations for two-layer networks PDF

Cannot Refute

[63] Convergence guarantees for gradient descent in deep neural networks with non-convex loss functions PDF

Cannot Refute

[64] Deep learning generalization and the convex hull of training sets PDF

Cannot Refute

Contribution

Asymptotic O(1/√T) convergence bound for qualified learning rate schedules

[9] The large learning rate phase of deep learning: the catapult mechanism PDF

Cannot Refute

[46] On the variance of the adaptive learning rate and beyond PDF

Cannot Refute

[47] Learning-rate-free learning by d-adaptation PDF

Cannot Refute

[48] Increased rates of convergence through learning rate adaptation PDF

Cannot Refute

[49] Adaptive stochastic conjugate gradient optimization for backpropagation neural networks PDF

Cannot Refute

[50] A comparative analysis of gradient descent-based optimization algorithms on convolutional neural networks PDF

Cannot Refute

[51] Super-convergence: Very fast training of neural networks using large learning rates PDF

Cannot Refute

[52] AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks PDF

Cannot Refute

[53] A stochastic gradient method with variance control and variable learning rate for Deep Learning PDF

Cannot Refute

[54] AMC: Adaptive Learning Rate Adjustment Based on Model Complexity PDF

Cannot Refute

Contribution

Scaling law for loss and learning rate across training horizons and model sizes

[1] Functional scaling laws in kernel regression: Loss dynamics and learning rate schedules PDF

Can Refute

[65] Explaining neural scaling laws PDF

Cannot Refute

[66] Scaling laws for neural language models PDF

Cannot Refute

[67] Scaling Laws for Autoregressive Generative Modeling PDF

Cannot Refute

[68] Scaling laws for transfer PDF

Cannot Refute

[69] Scaling laws for precision PDF

Cannot Refute

[70] Towards Precise Scaling Laws for Video Diffusion Transformers PDF

Cannot Refute

[71] Scaling Law with Learning Rate Annealing PDF

Cannot Refute

[72] Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs PDF

Cannot Refute

[73] Scaling laws for single-agent reinforcement learning PDF

Cannot Refute

Convex Dominance in Deep Learning: A Scaling Law of Loss and Learning Rate

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Convex-like behavior characterization in deep learning via sequence-to-sequence loss prediction

[55] Input Convex Neural Networks PDF

[56] Differentiable convex optimization layers PDF

[57] Gradient Descent Finds Global Minima of Deep Neural Networks PDF

[58] The Break-Even Point on Optimization Trajectories of Deep Neural Networks PDF

[59] Convex optimization for machine learning PDF

[60] Deep Learning without Poor Local Minima PDF

[61] Optimization landscape of neural networks PDF

[62] Neural networks are convex regularizers: Exact polynomial-time convex optimization formulations for two-layer networks PDF

[63] Convergence guarantees for gradient descent in deep neural networks with non-convex loss functions PDF

[64] Deep learning generalization and the convex hull of training sets PDF

Asymptotic O(1/√T) convergence bound for qualified learning rate schedules

[9] The large learning rate phase of deep learning: the catapult mechanism PDF

[46] On the variance of the adaptive learning rate and beyond PDF

[47] Learning-rate-free learning by d-adaptation PDF

[48] Increased rates of convergence through learning rate adaptation PDF

[49] Adaptive stochastic conjugate gradient optimization for backpropagation neural networks PDF

[50] A comparative analysis of gradient descent-based optimization algorithms on convolutional neural networks PDF

[51] Super-convergence: Very fast training of neural networks using large learning rates PDF

[52] AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks PDF

[53] A stochastic gradient method with variance control and variable learning rate for Deep Learning PDF

[54] AMC: Adaptive Learning Rate Adjustment Based on Model Complexity PDF

Scaling law for loss and learning rate across training horizons and model sizes

[1] Functional scaling laws in kernel regression: Loss dynamics and learning rate schedules PDF

[65] Explaining neural scaling laws PDF

[66] Scaling laws for neural language models PDF

[67] Scaling Laws for Autoregressive Generative Modeling PDF

[68] Scaling laws for transfer PDF

[69] Scaling laws for precision PDF

[70] Towards Precise Scaling Laws for Video Diffusion Transformers PDF

[71] Scaling Law with Learning Rate Annealing PDF

[72] Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs PDF

[73] Scaling laws for single-agent reinforcement learning PDF

Table of Contents