Convex Dominance in Deep Learning: A Scaling Law of Loss and Learning Rate

ICLR 2026 Conference SubmissionAnonymous Authors
Convex optimizationScaling lawLearning rate transfer
Abstract:

Deep learning has non-convex loss landscape and its optimization dynamics is hard to analyze or control. Nevertheless, the dynamics can be empirically convex-like across various tasks, models, optimizers, hyperparameters, etc. In this work, we examine the applicability of convexity and Lipschitz continuity in deep learning, in order to precisely control the loss dynamics via the learning rate schedules. We illustrate that deep learning quickly becomes weakly convex after a short period of training, and the loss is predicable by an upper bound on the last iterate, which further informs the scaling of optimal learning rate. Through the lens of convexity, we build scaling laws of learning rates and losses that extrapolate as much as 80×80\times across training horizons and 70×70\times across model sizes.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes that deep learning loss landscapes exhibit weak convexity after initial training, enabling sequence-to-sequence loss prediction and derivation of optimal learning rate scaling laws. It occupies the 'Convexity and Lipschitz Continuity Perspectives' leaf within the 'Theoretical Foundations and Scaling Laws' branch. Notably, this leaf contains only the original paper itself—no sibling papers share this specific focus on convexity emergence for schedule design. This isolation suggests the convexity-centric framing represents a relatively sparse research direction within the broader taxonomy of forty-five papers across thirty-six topics.

The taxonomy reveals that neighboring theoretical approaches pursue different mathematical frameworks: 'Kernel Regression and Intrinsic-Time Models' analyzes dynamics via kernel methods and stochastic differential equations, while 'Empirical Scaling Laws' fits multi-power functional forms without convexity assumptions. The 'Universality and Compute-Optimal Training' leaf studies normalized loss collapse under compute-optimal regimes. Meanwhile, the 'Loss Landscape Geometry and Training Stability' branch examines curvature and sharpness without assuming convexity. The paper's convexity lens thus diverges from both kernel-theoretic foundations and purely empirical curve-fitting, occupying a distinct methodological niche.

Among thirty candidates examined, the scaling law contribution (Contribution C) encountered one refutable candidate, while the convex-like behavior characterization (Contribution A) and convergence bound (Contribution B) each examined ten candidates with zero refutations. This limited search scope—thirty papers from semantic retrieval—means the analysis captures top-ranked matches but cannot claim exhaustive coverage. The single refutation for Contribution C suggests some prior work on loss-learning-rate scaling exists, whereas the convexity-based prediction framework (Contribution A) appears less directly addressed in the examined literature. The convergence bound's zero refutations may reflect its specific technical formulation rather than absolute novelty.

Given the restricted search scale and the paper's unique position as the sole occupant of its taxonomy leaf, the work appears to introduce a relatively underexplored theoretical perspective. However, the presence of one refutable candidate for the scaling law component indicates partial overlap with existing empirical scaling studies. The analysis reflects top-thirty semantic matches and does not constitute a comprehensive field survey, leaving open the possibility of additional relevant work outside this candidate set.

Taxonomy

Core-task Taxonomy Papers
45
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Predicting loss dynamics via learning rate schedules in deep learning. The field encompasses a broad spectrum of approaches, from theoretical foundations that explore scaling laws and convexity properties to practical methods for adaptive scheduling and application-specific tuning. At the highest level, the taxonomy reveals several major branches: Theoretical Foundations and Scaling Laws examine how loss evolves under different mathematical assumptions, often leveraging tools like Lipschitz continuity and convex analysis; Loss Landscape Geometry and Training Stability investigate the shape of the optimization surface and its impact on convergence; Adaptive and Automated Learning Rate Scheduling develops data-driven or feedback-based mechanisms to adjust rates on the fly; Parametric and Heuristic Schedules catalog classical decay patterns and their variants; Hyperparameter Transfer and Scaling address how schedules generalize across model sizes and datasets; Large-Batch and Distributed Training focuses on challenges unique to parallel computation; Continual and Transfer Learning Dynamics study how schedules must adapt when tasks or data distributions shift; Application-Specific Scheduling Methods tailor strategies to domains like vision or language; and Empirical Analysis and Heuristics distill practical insights from large-scale experiments. Representative works such as Functional Scaling Laws[1] and Multi-Power Law[2] illustrate the theoretical side, while AdaLRS[7] and Incremental PID Scheduler[22] exemplify adaptive automation. Within this landscape, a particularly active line of inquiry contrasts rigorous theoretical guarantees with flexible, empirical heuristics. On one hand, studies like Convex Dominance[0] and Loss Curvature Perspective[12] seek to ground schedule design in formal properties of the loss surface, aiming to predict dynamics under well-defined smoothness or convexity conditions. On the other hand, adaptive methods such as Adaptive Learning Rate[4] and Volatility-based Scheduler[43] prioritize responsiveness to observed training signals, often sacrificing theoretical clarity for practical robustness. Convex Dominance[0] sits squarely in the theoretical camp, emphasizing convexity and Lipschitz continuity perspectives to derive principled predictions of loss trajectories. This contrasts with nearby empirical approaches like Closer Look Heuristics[45], which distill rules of thumb from extensive experimentation, and with adaptive frameworks like AdaLRS[7], which automate schedule adjustments without requiring strong geometric assumptions. The central tension remains how to balance mathematical rigor with the flexibility needed for diverse architectures and tasks.

Claimed Contributions

Convex-like behavior characterization in deep learning via sequence-to-sequence loss prediction

The authors establish that deep learning exhibits convex-like optimization dynamics across various architectures, optimizers, and learning rate schedules. They provide a non-asymptotic mapping that predicts loss sequences from learning rate sequences using an upper bound formulation generalized from convex analysis.

10 retrieved papers
Asymptotic O(1/√T) convergence bound for qualified learning rate schedules

The authors prove that deep learning achieves optimal convergence rate of O(1/√T) when the peak learning rate is scaled by 1/√T and the learning rate schedule satisfies a qualifying condition. They introduce a training-free qualifying exam to determine which schedules achieve this rate.

10 retrieved papers
Scaling law for loss and learning rate across training horizons and model sizes

The authors develop a two-dimensional scaling law that simultaneously predicts optimal loss and learning rate across different training horizons and model sizes. Their data-driven approach extrapolates up to 80× across training horizons and 70× across model sizes.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Convex-like behavior characterization in deep learning via sequence-to-sequence loss prediction

The authors establish that deep learning exhibits convex-like optimization dynamics across various architectures, optimizers, and learning rate schedules. They provide a non-asymptotic mapping that predicts loss sequences from learning rate sequences using an upper bound formulation generalized from convex analysis.

Contribution

Asymptotic O(1/√T) convergence bound for qualified learning rate schedules

The authors prove that deep learning achieves optimal convergence rate of O(1/√T) when the peak learning rate is scaled by 1/√T and the learning rate schedule satisfies a qualifying condition. They introduce a training-free qualifying exam to determine which schedules achieve this rate.

Contribution

Scaling law for loss and learning rate across training horizons and model sizes

The authors develop a two-dimensional scaling law that simultaneously predicts optimal loss and learning rate across different training horizons and model sizes. Their data-driven approach extrapolates up to 80× across training horizons and 70× across model sizes.