Neural Networks Learn Multi-Index Models Near the Information-Theoretic Limit

ICLR 2026 Conference SubmissionAnonymous Authors
Representation LearningMulti-Index ModelsTwo-Layer NetworkGradient DescentSample Complexity
Abstract:

In deep learning, a central issue is to understand how neural networks efficiently learn high-dimensional features. To this end, we explore the gradient descent learning of a general Gaussian Multi-index model f(x)=g(Ux)f(\boldsymbol{x})=g(\boldsymbol{U}\boldsymbol{x}) with hidden subspace URr×d\boldsymbol{U}\in \mathbb{R}^{r\times d}, which is the canonical setup to study representation learning. We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with od(1)o_d(1) test error using O~(d)\widetilde{\mathcal{O}}(d) samples and O~(d2)\widetilde{\mathcal{O}}(d^2) time. The sample and time complexity both align with the information-theoretic limit up to leading order and are therefore optimal. During the first stage of gradient descent learning, the proof proceeds via showing that the inner weights can perform a power-iteration process. This process implicitly mimics a spectral start for the whole span of the hidden subspace and eventually eliminates finite-sample noise and recovers this span. It surprisingly indicates that optimal results can only be achieved if the first layer is trained for more than O(1)\mathcal{O}(1) steps. This work demonstrates the ability of neural networks to effectively learn hierarchical functions with respect to both sample and time efficiency.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper establishes near-optimal sample and time complexity for learning Gaussian multi-index models via layer-wise gradient descent, achieving Õ(d) samples and Õ(d²) time that match information-theoretic limits. It resides in the 'Information-Theoretic Optimality and Complexity Bounds' leaf alongside two sibling papers within the 'Theoretical Foundations and Sample Complexity' branch. This leaf represents a focused research direction with only three papers total, indicating a relatively sparse but foundational area concerned with proving optimality guarantees rather than exploring algorithmic variants or training dynamics.

The taxonomy reveals neighboring research directions that contextualize this work. The sibling 'Computational Hardness and Learnability Limits' leaf (two papers) explores fundamental barriers, while the 'Training Dynamics and Gradient Flow Analysis' branch (seven papers across four leaves) examines temporal evolution and convergence properties. The paper bridges these areas by proving optimal complexity while analyzing gradient descent dynamics, particularly through its power-iteration mechanism. Its position in Theoretical Foundations distinguishes it from algorithmic contributions in the 'Algorithm Design' branch (seven papers) and application-focused work in 'Specialized Models' (eight papers).

Among twenty-nine candidates examined via limited semantic search, none clearly refute the three main contributions. The first contribution (near-optimal learning) examined ten candidates with zero refutations; the second (power-iteration mechanism) examined ten with zero refutations; the third (necessity of diverging first-layer steps) examined nine with zero refutations. This suggests that within the examined scope, the specific combination of optimal complexity guarantees, mechanistic analysis via power iteration, and the necessity result for first-layer training appears novel. However, the limited search scale means potentially relevant prior work outside the top-K matches may exist.

Based on the examined literature, the work appears to make substantive theoretical contributions at the intersection of optimality theory and training dynamics. The analysis covers top-thirty semantic matches plus citation expansion, providing reasonable confidence within this scope but not exhaustive coverage of the broader deep learning theory literature. The sparse population of the information-theoretic optimality leaf and absence of refutations among examined candidates suggest meaningful novelty, though comprehensive assessment would require broader search.

Taxonomy

Core-task Taxonomy Papers
27
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: gradient descent learning of multi-index models with neural networks. The field centers on understanding when and how neural networks can efficiently learn functions that depend on a small number of projections (indices) of high-dimensional inputs. The taxonomy reveals several complementary perspectives: Theoretical Foundations and Sample Complexity investigates information-theoretic limits and statistical requirements, often asking what sample sizes are necessary or sufficient for learning; Training Dynamics and Gradient Flow Analysis examines the evolution of parameters during optimization, characterizing convergence behavior and implicit biases; Algorithm Design and Methodological Approaches develops practical training schemes and computational strategies; Specialized Models and Applications tailors these ideas to particular architectures or problem domains; and Broader Perspectives synthesizes open questions across these threads. Representative works such as Gaussian Multi-Index Gradient Flow[2] and Gaussian Multi-Index Two-Timescale[3] illustrate how gradient-based methods can provably recover multi-index structure under favorable conditions, while studies like Multi-Index Time Complexity[4] and Reusing Batches Breaking Curse[5] explore computational and sample efficiency trade-offs. A central tension emerges between what is information-theoretically possible and what gradient descent can achieve in practice. Many studies focus on Gaussian or benign data distributions where gradient flow provably learns the correct indices, yet computational hardness results such as Ridge Combinations Computational Hardness[6] and Weak Learnability Computational Limits[9] suggest that worst-case instances may resist efficient learning. Within this landscape, Neural Multi-Index Information Limit[0] sits squarely in the information-theoretic optimality branch, establishing fundamental sample complexity bounds that any learner must respect. This contrasts with nearby works like Generic Multi-Index Information Limit[13], which may explore broader distributional settings, and Generative Leap Gaussian Multi-Index[27], which could emphasize generative or algorithmic aspects. By delineating the minimal information requirements, Neural Multi-Index Information Limit[0] provides a benchmark against which algorithmic approaches—whether gradient-based or otherwise—can be measured, highlighting the gap between statistical possibility and computational feasibility.

Claimed Contributions

Near-optimal learning of multi-index models via layer-wise gradient descent

The authors prove that a two-layer neural network trained with layer-wise gradient descent achieves information-theoretically optimal sample complexity (up to leading order) and time complexity for learning generic Gaussian multi-index models, matching the Θ(d) sample threshold and Θ(d²) time threshold.

10 retrieved papers
Power-iteration mechanism for feature learning in neural networks

The authors show that during the first stage of gradient descent, the inner layer weights evolve similarly to power method iterations on the local Hessian, which allows the network to recover the entire hidden subspace by balancing noise suppression and signal preservation over a diverging number of steps.

10 retrieved papers
Necessity of training the first layer for diverging steps

The authors establish that achieving optimal sample and time complexity requires training the first layer for a diverging (but sublogarithmic) number of steps, rather than a constant number, to properly balance noise elimination and full subspace recovery.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Near-optimal learning of multi-index models via layer-wise gradient descent

The authors prove that a two-layer neural network trained with layer-wise gradient descent achieves information-theoretically optimal sample complexity (up to leading order) and time complexity for learning generic Gaussian multi-index models, matching the Θ(d) sample threshold and Θ(d²) time threshold.

Contribution

Power-iteration mechanism for feature learning in neural networks

The authors show that during the first stage of gradient descent, the inner layer weights evolve similarly to power method iterations on the local Hessian, which allows the network to recover the entire hidden subspace by balancing noise suppression and signal preservation over a diverging number of steps.

Contribution

Necessity of training the first layer for diverging steps

The authors establish that achieving optimal sample and time complexity requires training the first layer for a diverging (but sublogarithmic) number of steps, rather than a constant number, to properly balance noise elimination and full subspace recovery.