Generalization Below the Edge of Stability: The Role of Data Geometry

ICLR 2026 Conference SubmissionAnonymous Authors
neural networksdeep learning theorygradient descentrepresentation learninggeneralization
Abstract:

Understanding generalization in overparameterized neural networks hinges on the interplay between the data geometry, neural architecture, and training dynamics. In this paper, we theoretically explore how data geometry controls this implicit bias. This paper presents theoretical results for overparametrized two-layer ReLU networks trained below the edge of stability. First, for data distributions supported on a mixture of low-dimensional balls, we derive generalization bounds that provably adapt to the intrinsic dimension. Second, for a family of isotropic distributions that vary in how strongly probability mass concentrates toward the unit sphere, we derive a spectrum of bounds showing that rates deteriorate as the mass concentrates toward the sphere. These results instantiate a unifying principle: When the data is harder to “shatter” with respect to the activation thresholds of the ReLU neurons, gradient descent tends to learn representations that capture shared patterns and thus finds solutions that generalize well. On the other hand, for data that is easily shattered (e.g., data supported on the sphere) gradient descent favors memorization. Our theoretical results consolidate disparate empirical findings that have appeared in the literature.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes generalization bounds for two-layer ReLU networks trained below the edge of stability, focusing on how data geometry—specifically intrinsic dimension and concentration properties—controls implicit bias. It resides in the 'Data Geometry and Generalization' leaf under 'Generalization Theory and Implicit Bias', which contains only two papers total. This is a notably sparse research direction within the broader taxonomy of 29 papers across multiple branches, suggesting the intersection of data geometry and edge-of-stability training remains relatively underexplored compared to sharpness-focused or stability-based approaches.

The taxonomy reveals neighboring leaves addressing related but distinct mechanisms: 'Flatness, Sharpness, and Minima Stability' examines loss curvature without explicit data geometry focus, while 'Feature Learning and Implicit Regularization' analyzes representation dynamics beyond kernel regimes. The parent branch 'Generalization Theory and Implicit Bias' excludes stability-based bounds, which are instead covered under 'Algorithmic Stability and Generalization Bounds'. The paper's emphasis on shattering and concentration connects it to geometric perspectives but diverges from purely algorithmic or architectural analyses found in sibling branches like 'Optimization Dynamics and Convergence' or 'Specialized Architectures'.

Among 14 candidates examined, the contribution on intrinsic-dimension-adaptive bounds encountered 10 candidates with 2 appearing refutable, while the data shatterability principle faced 2 candidates with 1 refutable. The Beta-radial distribution spectrum showed no refutations among 2 candidates examined. These statistics reflect a limited semantic search scope, not exhaustive coverage. The intrinsic dimension result appears to have more substantial prior overlap within the examined set, whereas the concentration-dependent spectrum and unifying shatterability principle show fewer direct precedents among the candidates retrieved.

Based on the top-14 semantic matches examined, the work appears to occupy a relatively sparse niche connecting data geometry to edge-of-stability generalization. The limited search scope means potentially relevant work outside the top-K retrieval or citation expansion may exist. The taxonomy structure suggests this intersection is less crowded than sharpness-based or stability-focused directions, though the contribution-level statistics indicate varying degrees of novelty across the paper's three main claims.

Taxonomy

Core-task Taxonomy Papers
29
3
Claimed Contributions
14
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Generalization of overparameterized neural networks trained below the edge of stability. The field structure reflects a multifaceted investigation into why and how modern deep networks generalize despite operating in regimes once thought unstable. The taxonomy organizes work into six main branches: Edge of Stability Dynamics examines training behavior near critical learning rates where loss oscillates yet networks converge; Generalization Theory and Implicit Bias explores how optimization algorithms favor certain solutions over others, often through geometric or data-dependent mechanisms; Optimization Dynamics studies convergence properties and algorithmic choices; Algorithmic Stability provides formal generalization bounds; Specialized Architectures addresses domain-specific settings; and Theoretical Frameworks develops foundational mathematical tools. Representative works such as Sharpness-aware Edge Stability[1] and Matrix Factorization Edge[2] illustrate how different problem settings reveal distinct stability phenomena, while SGD Global Minima[3] and Stable Minima Large Steps[5] anchor classical perspectives on convergence. Recent lines of work highlight contrasting themes around data geometry, implicit regularization, and the interplay between sharpness and generalization. Data Geometry Generalization[0] sits within the Generalization Theory and Implicit Bias branch, specifically focusing on how data structure shapes generalization outcomes below the edge of stability. This emphasis aligns closely with Data Geometry Edge[10], which similarly investigates geometric properties of training data in edge-of-stability regimes. In contrast, works like Feature Learning Edge[13] and Conflicting Biases Edge[23] examine how feature representations evolve under large learning rates, while Dynamical Stability SGD[11] and Stability Without NTK[9] probe algorithmic stability from complementary angles. The original paper's focus on data geometry positions it among efforts to understand generalization through the lens of input structure rather than purely algorithmic or architectural factors, offering a perspective that complements sharpness-based analyses like SAM versus SGD[25] and geometry-compactness approaches such as Geometry Compactness Stability[24].

Claimed Contributions

Generalization bounds adapting to intrinsic dimension for mixture-of-subspaces data

The authors prove that for data supported on a union of low-dimensional subspaces, gradient descent below the edge of stability achieves generalization rates that scale with the intrinsic dimension m rather than the ambient dimension d, demonstrating provable adaptation to low-dimensional structure.

10 retrieved papers
Can Refute
Spectrum of generalization bounds for isotropic Beta-radial distributions

The authors establish a family of upper and lower bounds for isotropic distributions parameterized by radial concentration α, showing that generalization degrades as probability mass concentrates toward the boundary, with matching constructions demonstrating tightness.

2 retrieved papers
Data shatterability principle unifying implicit regularization and geometry

The authors introduce the principle of data shatterability as a unifying framework explaining how data geometry controls implicit regularization below the edge of stability, showing that less shatterable data leads to stronger regularization and better generalization.

2 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Generalization bounds adapting to intrinsic dimension for mixture-of-subspaces data

The authors prove that for data supported on a union of low-dimensional subspaces, gradient descent below the edge of stability achieves generalization rates that scale with the intrinsic dimension m rather than the ambient dimension d, demonstrating provable adaptation to low-dimensional structure.

Contribution

Spectrum of generalization bounds for isotropic Beta-radial distributions

The authors establish a family of upper and lower bounds for isotropic distributions parameterized by radial concentration α, showing that generalization degrades as probability mass concentrates toward the boundary, with matching constructions demonstrating tightness.

Contribution

Data shatterability principle unifying implicit regularization and geometry

The authors introduce the principle of data shatterability as a unifying framework explaining how data geometry controls implicit regularization below the edge of stability, showing that less shatterable data leads to stronger regularization and better generalization.

Generalization Below the Edge of Stability: The Role of Data Geometry | Novelty Validation