Directional Convergence, Benign Overfitting of Gradient Descent in leaky ReLU two-layer Neural Networks

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Benign overfittingImplicit biasneural networksclassification

In this paper, we provide sufficient conditions of benign overfitting of fixed width leaky ReLU two-layer neural network classifiers trained on mixture data via gradient descent. Our results are derived by establishing directional convergence of the network parameters and classification error bound of the convergent direction. Our classification error bound also lead to the discovery of a newly identified phase transition. Previously, directional convergence in (leaky) ReLU neural networks was established only for gradient flow. Due to the lack of directional convergence, previous results on benign overfitting were limited to those trained on nearly orthogonal data. All of our results hold on mixture data, which is a broader data setting than the nearly orthogonal data setting in prior work. We demonstrate our findings by showing that benign overfitting occurs with high probability in a much wider range of scenarios than previously known. Our results also allow us to characterize cases when benign overfitting provably fails even if directional convergence occurs. Our work thus provides a more complete picture of benign overfitting in leaky ReLU two-layer neural networks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper establishes directional convergence for gradient descent (not just gradient flow) in leaky ReLU two-layer networks and derives classification error bounds revealing a phase transition in benign overfitting. It sits in the 'Gradient Descent Training Dynamics' leaf under 'ReLU and Leaky ReLU Networks', which contains four papers total. This is a moderately populated research direction within the broader taxonomy of 34 papers across the field, indicating focused but not overcrowded attention to gradient descent dynamics in ReLU-type networks.

The taxonomy shows this leaf is one of three under 'ReLU and Leaky ReLU Networks', with sibling leaves examining 'Hinge Loss and Margin Maximization' (three papers) and 'Logistic Loss and Classification' (two papers). Neighboring branches explore 'Convolutional Neural Networks' (four papers) and 'Linear and Smooth Activation Networks' (three papers). The scope_note clarifies this leaf focuses specifically on directional convergence under gradient descent/flow, excluding alternative loss functions. The paper's extension from gradient flow to gradient descent represents a technical advance within this established research direction.

Among 26 candidates examined, the contribution on directional convergence for gradient descent (10 candidates, 0 refutable) and the phase transition discovery (10 candidates, 0 refutable) appear novel within the limited search scope. However, the claim of extending results to 'broader data settings' (6 candidates examined, 2 refutable) shows more substantial prior work overlap. The statistics suggest the first two contributions face less direct competition among the examined candidates, while the data generality claim encounters existing work addressing similar mixture or non-orthogonal data scenarios.

Based on the top-26 semantic matches examined, the technical contributions on gradient descent convergence and phase transitions appear relatively novel, while the data setting extension shows clearer overlap with prior work. The analysis covers a focused subset of the literature; a broader search might reveal additional related work, particularly in the 'Data Characteristics and Noise Models' branch (nine papers) which was not the primary focus of this candidate examination.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: benign overfitting in two-layer neural networks. This field investigates the phenomenon where overparameterized shallow networks interpolate noisy training data yet still generalize well on test data, defying classical statistical intuition. The taxonomy organizes research into several main branches: Activation Function and Architecture Variants explore how different nonlinearities (ReLU, leaky ReLU) and architectural choices influence benign overfitting; Data Characteristics and Noise Models examine the role of label noise, feature structure, and sample complexity; Adversarial Robustness and Security study whether benign overfitting persists under adversarial perturbations; Generalization Theory and Implicit Regularization analyze the implicit biases of gradient-based training that enable good generalization despite interpolation; and Extended Architectures and Generalizations broaden the scope to convolutional networks, transformers, and deeper models. Representative works such as Benign Overfitting ReLU[3] and Benign Overfitting Leaky ReLU[8] illustrate how activation choices shape the training dynamics, while studies like Benign Overfitting Adversarial[2] and Benign Overfitting Noisy Features[26] highlight the interplay between data properties and overfitting behavior. A particularly active line of work focuses on gradient descent dynamics with ReLU-type activations, examining how directional convergence and implicit bias lead networks toward max-margin solutions that generalize despite perfect training fit. Directional Convergence Leaky ReLU[0] sits squarely within this branch, analyzing how leaky ReLU networks trained by gradient descent exhibit directional convergence properties that facilitate benign overfitting. This work closely relates to Benign Overfitting ReLU[3], which establishes foundational results for standard ReLU networks, and contrasts with Benign Overfitting Regression[5], which explores similar phenomena in simpler regression settings without the complexities of nonlinear activations. Meanwhile, other branches investigate whether benign overfitting extends to adversarially robust training or whether it breaks down under distribution shift, revealing trade-offs between interpolation, generalization, and robustness. Open questions remain about the precise conditions under which benign overfitting occurs, the role of initialization and architecture depth, and how these insights scale to practical deep learning scenarios.

Claimed Contributions

Directional convergence of gradient descent in leaky ReLU two-layer neural networks

10 retrieved papers

The authors establish directional convergence of gradient descent for leaky ReLU two-layer neural networks trained on mixture data with exponential loss, providing precise characterization of the convergent direction. This is the first such result for ReLU-type networks under gradient descent, extending beyond prior work limited to gradient flow or nearly orthogonal data.

10 retrieved papers

Classification error bounds revealing phase transition in benign overfitting

10 retrieved papers

The authors derive classification error bounds for the convergent direction that reveal a phase transition between weak signal and strong signal regimes. They provide both upper and lower bounds for Gaussian mixtures, showing when benign overfitting occurs or provably fails even with directional convergence.

10 retrieved papers

Extension of benign overfitting results to broader data settings

Can Refute

6 retrieved papers

The authors extend benign overfitting results beyond the nearly orthogonal data regime studied in prior work to general mixture data settings, including polynomially tailed distributions. Their deterministic conditions allow proving benign overfitting with high probability under weaker distributional assumptions than previous sub-Gaussian requirements.

6 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Benign Overfitting for Two-layer ReLU Networks PDF

Kou, Yiwen, Yiwen Kou, Chen, Zixiang, Zixiang Chen, Y. Kou, Yuanzhou, Yuanzhou Chen, Zi-Yuan Chen, Gu, Quanquan, Quanquan Gu (2023) • arXiv.org

[5] Benign overfitting for regression with trained two-layer relu networks PDF

Park junhyung, Bloebaum, Patrick, Junhyung Park, Kasiviswanathan, Shiva Prasad, Patrick Bloebaum, S. Kasiviswanathan (2024)

[18] Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes PDF

Dan Qiao, Esha Singh, Daniel Soudry, Yu-Xiang Wang, Kaiqi Zhang (2024) • Neural Information Processing Systems

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Directional convergence of gradient descent in leaky ReLU two-layer neural networks

[4] Benign Overfitting in Two-layer ReLU Convolutional Neural Networks PDF

Cannot Refute

[43] Topological obstruction to the training of shallow ReLU neural networks PDF

Cannot Refute

[44] Gradient descent on two-layer nets: Margin maximization and simplicity bias PDF

Cannot Refute

[45] Feature selection and low test error in shallow low-rotation relu networks PDF

Cannot Refute

[46] Towards understanding learning in neural networks with linear teachers PDF

Cannot Refute

[47] SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data PDF

Cannot Refute

[48] Learning a neuron by a shallow relu network: Dynamics and implicit bias for correlated inputs PDF

Cannot Refute

[49] Non-Singularity of the Gradient Descent map for Neural Networks with Piecewise Analytic Activations PDF

Cannot Refute

[50] Training two-layer RELU networks with gradient descent is inconsistent PDF

Cannot Refute

[51] The Implicit Bias of Minima Stability in Multivariate Shallow ReLU Networks PDF

Cannot Refute

Contribution

Classification error bounds revealing phase transition in benign overfitting

[6] Rethinking Benign Overfitting in Two-Layer Neural Networks PDF

Cannot Refute

[8] Benign overfitting in leaky ReLU networks with moderate input dimension PDF

Cannot Refute

[16] Unveil benign overfitting for transformer in vision: Training dynamics, convergence, and generalization PDF

Cannot Refute

[21] Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data PDF

Cannot Refute

[35] Memorizing without overfitting: Bias, variance, and interpolation in overparameterized models PDF

Cannot Refute

[36] Universal scaling laws of absorbing phase transitions in artificial deep neural networks PDF

Cannot Refute

[37] Benign overfitting of non-smooth neural networks beyond lazy training PDF

Cannot Refute

[38] Benign overfitting in adversarially robust linear classification PDF

Cannot Refute

[39] Understanding generalization in transformers: Error bounds and training dynamics under benign and harmful overfitting PDF

Cannot Refute

[40] Benign, tempered, or catastrophic: Toward a refined taxonomy of overfitting PDF

Cannot Refute

Contribution

Extension of benign overfitting results to broader data settings

[8] Benign overfitting in leaky ReLU networks with moderate input dimension PDF

Can Refute

[37] Benign overfitting of non-smooth neural networks beyond lazy training PDF

Can Refute

[7] Benign overfitting and grokking in relu networks for xor cluster data PDF

Cannot Refute

[34] DIRECTIONAL CONVERGENCE, BENIGN OVERFITTING PDF

Cannot Refute

[41] Benign overfitting in multiclass classification: All roads lead to interpolation PDF

Cannot Refute

[42] Binary Classification of Gaussian Mixtures: Abundance of Support Vectors, Benign Overfitting, and Regularization PDF

Cannot Refute

Directional Convergence, Benign Overfitting of Gradient Descent in leaky ReLU two-layer Neural Networks

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Benign Overfitting for Two-layer ReLU Networks PDF

[5] Benign overfitting for regression with trained two-layer relu networks PDF

[18] Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes PDF

Contribution Analysis

Directional convergence of gradient descent in leaky ReLU two-layer neural networks

[4] Benign Overfitting in Two-layer ReLU Convolutional Neural Networks PDF

[43] Topological obstruction to the training of shallow ReLU neural networks PDF

[44] Gradient descent on two-layer nets: Margin maximization and simplicity bias PDF

[45] Feature selection and low test error in shallow low-rotation relu networks PDF

[46] Towards understanding learning in neural networks with linear teachers PDF

[47] SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data PDF

[48] Learning a neuron by a shallow relu network: Dynamics and implicit bias for correlated inputs PDF

[49] Non-Singularity of the Gradient Descent map for Neural Networks with Piecewise Analytic Activations PDF

[50] Training two-layer RELU networks with gradient descent is inconsistent PDF

[51] The Implicit Bias of Minima Stability in Multivariate Shallow ReLU Networks PDF

Classification error bounds revealing phase transition in benign overfitting

[6] Rethinking Benign Overfitting in Two-Layer Neural Networks PDF

[8] Benign overfitting in leaky ReLU networks with moderate input dimension PDF

[16] Unveil benign overfitting for transformer in vision: Training dynamics, convergence, and generalization PDF

[21] Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data PDF

[35] Memorizing without overfitting: Bias, variance, and interpolation in overparameterized models PDF

[36] Universal scaling laws of absorbing phase transitions in artificial deep neural networks PDF

[37] Benign overfitting of non-smooth neural networks beyond lazy training PDF

[38] Benign overfitting in adversarially robust linear classification PDF

[39] Understanding generalization in transformers: Error bounds and training dynamics under benign and harmful overfitting PDF

[40] Benign, tempered, or catastrophic: Toward a refined taxonomy of overfitting PDF

Extension of benign overfitting results to broader data settings

[8] Benign overfitting in leaky ReLU networks with moderate input dimension PDF

[37] Benign overfitting of non-smooth neural networks beyond lazy training PDF

[7] Benign overfitting and grokking in relu networks for xor cluster data PDF

[34] DIRECTIONAL CONVERGENCE, BENIGN OVERFITTING PDF

[41] Benign overfitting in multiclass classification: All roads lead to interpolation PDF

[42] Binary Classification of Gaussian Mixtures: Abundance of Support Vectors, Benign Overfitting, and Regularization PDF

Table of Contents