Neural Networks Learn Multi-Index Models Near the Information-Theoretic Limit
Overview
Overall Novelty Assessment
The paper establishes near-optimal sample and time complexity for learning Gaussian multi-index models via layer-wise gradient descent, achieving Õ(d) samples and Õ(d²) time that match information-theoretic limits. It resides in the 'Information-Theoretic Optimality and Complexity Bounds' leaf alongside two sibling papers within the 'Theoretical Foundations and Sample Complexity' branch. This leaf represents a focused research direction with only three papers total, indicating a relatively sparse but foundational area concerned with proving optimality guarantees rather than exploring algorithmic variants or training dynamics.
The taxonomy reveals neighboring research directions that contextualize this work. The sibling 'Computational Hardness and Learnability Limits' leaf (two papers) explores fundamental barriers, while the 'Training Dynamics and Gradient Flow Analysis' branch (seven papers across four leaves) examines temporal evolution and convergence properties. The paper bridges these areas by proving optimal complexity while analyzing gradient descent dynamics, particularly through its power-iteration mechanism. Its position in Theoretical Foundations distinguishes it from algorithmic contributions in the 'Algorithm Design' branch (seven papers) and application-focused work in 'Specialized Models' (eight papers).
Among twenty-nine candidates examined via limited semantic search, none clearly refute the three main contributions. The first contribution (near-optimal learning) examined ten candidates with zero refutations; the second (power-iteration mechanism) examined ten with zero refutations; the third (necessity of diverging first-layer steps) examined nine with zero refutations. This suggests that within the examined scope, the specific combination of optimal complexity guarantees, mechanistic analysis via power iteration, and the necessity result for first-layer training appears novel. However, the limited search scale means potentially relevant prior work outside the top-K matches may exist.
Based on the examined literature, the work appears to make substantive theoretical contributions at the intersection of optimality theory and training dynamics. The analysis covers top-thirty semantic matches plus citation expansion, providing reasonable confidence within this scope but not exhaustive coverage of the broader deep learning theory literature. The sparse population of the information-theoretic optimality leaf and absence of refutations among examined candidates suggest meaningful novelty, though comprehensive assessment would require broader search.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors prove that a two-layer neural network trained with layer-wise gradient descent achieves information-theoretically optimal sample complexity (up to leading order) and time complexity for learning generic Gaussian multi-index models, matching the Θ(d) sample threshold and Θ(d²) time threshold.
The authors show that during the first stage of gradient descent, the inner layer weights evolve similarly to power method iterations on the local Hessian, which allows the network to recover the entire hidden subspace by balancing noise suppression and signal preservation over a diverging number of steps.
The authors establish that achieving optimal sample and time complexity requires training the first layer for a diverging (but sublogarithmic) number of steps, rather than a constant number, to properly balance noise elimination and full subspace recovery.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Near-optimal learning of multi-index models via layer-wise gradient descent
The authors prove that a two-layer neural network trained with layer-wise gradient descent achieves information-theoretically optimal sample complexity (up to leading order) and time complexity for learning generic Gaussian multi-index models, matching the Θ(d) sample threshold and Θ(d²) time threshold.
[1] Survey on Algorithms for multi-index models PDF
[2] On Learning Gaussian Multi-index Models with Gradient Flow PDF
[7] Low-dimensional functions are efficiently learnable under randomly biased distributions PDF
[10] Repetita iuvant: Data repetition allows sgd to learn high-dimensional multi-index functions PDF
[47] Online Learning of Neural Networks PDF
[48] Fundamental limits of learning in sequence multi-index models and deep attention networks: High-dimensional asymptotics and sharp thresholds PDF
[49] Learning multi-index models with neural networks via mean-field langevin dynamics PDF
[50] Omnipredicting Single-Index Models with Multi-index Models PDF
[51] Pruning is Optimal for Learning Sparse Features in High-Dimensions PDF
[52] Fast and Sample Near-Optimal Algorithms for Learning Multidimensional Histograms PDF
Power-iteration mechanism for feature learning in neural networks
The authors show that during the first stage of gradient descent, the inner layer weights evolve similarly to power method iterations on the local Hessian, which allows the network to recover the entire hidden subspace by balancing noise suppression and signal preservation over a diverging number of steps.
[28] High-dimensional optimization for multi-spiked tensor pca PDF
[29] Commonality and Individuality-Based Subspace Learning PDF
[30] Self-supervised deep multi-view subspace clustering PDF
[31] Portable Executable Header Based Ransomware Detection using Power Iteration and Artificial Neural Network PDF
[32] GSS: Graph-based subspace learning with shots initialization for few-shot recognition PDF
[33] Removal of hidden units and weights for back propagation networks PDF
[34] CP-decomposition with Tensor Power Method for Convolutional Neural Networks compression PDF
[35] Understanding the message passing in graph neural networks via power iteration clustering PDF
[36] Generative Restricted Kernel Machines. PDF
[37] An efficient matrix factorization based low-rank representation for subspace clustering PDF
Necessity of training the first layer for diverging steps
The authors establish that achieving optimal sample and time complexity requires training the first layer for a diverging (but sublogarithmic) number of steps, rather than a constant number, to properly balance noise elimination and full subspace recovery.