Saddle-To-Saddle Dynamics in Deep ReLU Networks: Low-Rank Bias in the First Saddle Escape

ICLR 2026 Conference SubmissionAnonymous Authors
Saddle-to-SaddleImplicit biasLow-rank biasBottleneck rank
Abstract:

When a deep ReLU network is initialized with small weights, gradient descent (GD) is at first dominated by the saddle at the origin in parameter space. We study the so-called escape directions along which GD leaves the origin, which play a similar role as the eigenvectors of the Hessian for strict saddles. We show that the optimal escape direction features a \textit{low-rank bias} in its deeper layers: the first singular value of the \ell-th layer weight matrix is at least 14\ell^{\frac{1}{4}} larger than any other singular value. We also prove a number of related results about these escape directions. We suggest that deep ReLU networks exhibit saddle-to-saddle dynamics, with GD visiting a sequence of saddles with increasing bottleneck rank.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper characterizes optimal escape directions from the origin saddle in deep ReLU networks initialized with small weights, proving a low-rank bias where the first singular value of layer ℓ exceeds others by at least ℓ^(1/4). It proposes a saddle-to-saddle dynamics framework where gradient descent visits a sequence of saddles with increasing bottleneck rank. Within the taxonomy, it resides in 'Deep Network Saddle Point Dynamics' alongside one sibling paper, making this a relatively sparse research direction with only two papers in this leaf out of fourteen total papers across the taxonomy.

The taxonomy reveals that most theoretical work concentrates on shallow networks (three papers in 'Shallow Network Convergence and Implicit Bias') or over-parameterized convergence guarantees (two papers in 'Over-Parameterized Regime Convergence'). The paper's focus on deep network saddle structure diverges from these neighboring areas, which either restrict to one or two layers or assume sufficient over-parameterization to avoid saddle complications. The 'Population Gradient Analysis' leaf contains one paper examining critical point structure more generally, but without the depth-dependent rank evolution emphasis central to this work.

Among thirty candidates examined across three contributions, none were identified as clearly refuting the paper's claims. The low-rank bias characterization examined ten candidates with zero refutable matches, as did the saddle-to-saddle dynamics framework and the escape direction characterization. This suggests that within the limited search scope, the specific combination of depth-dependent singular value separation (ℓ^(1/4) scaling) and the sequential saddle visitation framework appears novel. However, the small candidate pool and sparse taxonomy leaf indicate this may reflect an under-explored research direction rather than exhaustive validation.

The analysis covers top-thirty semantic matches and reveals no substantial prior work overlap within this scope. The sparse taxonomy structure and absence of refutable candidates suggest the paper addresses questions not extensively tackled in existing literature, though the limited search scale means potentially relevant work outside this candidate set remains unexamined. The depth-specific rank evolution and saddle sequence framework appear distinctive within the surveyed literature.

Taxonomy

Core-task Taxonomy Papers
14
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Understanding gradient descent dynamics in deep ReLU networks with small initialization. The field structure reflects a multifaceted investigation into how neural networks learn from the ground up. The taxonomy organizes research into four main branches: theoretical characterization of gradient flow dynamics, which examines the continuous-time evolution of parameters and the geometric properties of trajectories; convergence guarantees and optimization landscape properties, which study conditions under which training succeeds and the structure of loss surfaces; initialization strategies and early training behavior, which explore how initial parameter choices shape subsequent learning (e.g., Weight Initialization Review[3], Initialization Effect[8]); and empirical analysis of training dynamics, which documents observed phenomena through experiments. These branches are deeply interconnected, as theoretical insights about flow dynamics often inform initialization choices, while empirical observations motivate new convergence analyses. Within the theoretical characterization branch, a particularly active line of work focuses on saddle point dynamics and the interplay between network depth and gradient behavior. Saddle Escape Dynamics[0] sits squarely in this area, examining how gradient descent navigates critical points in deep networks initialized at small scales. This contrasts with works like Shallow ReLU Dynamics[2], which restricts attention to simpler architectures, and complements studies such as Vanishing Curvature[11], which investigates how curvature properties evolve during training. A recurring theme across these studies is the tension between small initialization—which can lead to slow early progress or stagnation near saddle points—and the need for sufficient parameter movement to escape unfavorable regions. The original paper's emphasis on saddle escape mechanisms in deep settings addresses open questions about how depth amplifies or mitigates these challenges, positioning it among theoretical efforts to explain early training phases beyond standard lazy regime analyses.

Claimed Contributions

Low-rank bias characterization in optimal escape directions of deep ReLU networks

The authors prove that in deep ReLU networks initialized with small weights, the optimal escape direction from the saddle at the origin exhibits a low-rank bias that strengthens in deeper layers, with the second singular value being O(l^(-1/4)) times smaller than the first singular value for layers beyond a certain depth.

10 retrieved papers
Saddle-to-saddle dynamics framework for deep ReLU networks

The authors propose that gradient descent in deep ReLU networks follows saddle-to-saddle dynamics where the bottleneck rank gradually increases across successive saddles, analogous to rank incremental learning in linear networks but using bottleneck rank as the appropriate notion of sparsity.

10 retrieved papers
Characterization of escape directions and speeds at the origin saddle

The authors provide a theoretical description of the saddle at the origin in deep ReLU networks and characterize the escape directions that gradient descent takes when leaving this first saddle, including proving that the optimal escape speed is non-decreasing in network depth.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Low-rank bias characterization in optimal escape directions of deep ReLU networks

The authors prove that in deep ReLU networks initialized with small weights, the optimal escape direction from the saddle at the origin exhibits a low-rank bias that strengthens in deeper layers, with the second singular value being O(l^(-1/4)) times smaller than the first singular value for layers beyond a certain depth.

Contribution

Saddle-to-saddle dynamics framework for deep ReLU networks

The authors propose that gradient descent in deep ReLU networks follows saddle-to-saddle dynamics where the bottleneck rank gradually increases across successive saddles, analogous to rank incremental learning in linear networks but using bottleneck rank as the appropriate notion of sparsity.

Contribution

Characterization of escape directions and speeds at the origin saddle

The authors provide a theoretical description of the saddle at the origin in deep ReLU networks and characterize the escape directions that gradient descent takes when leaving this first saddle, including proving that the optimal escape speed is non-decreasing in network depth.