Neural Networks Learn Multi-Index Models Near the Information-Theoretic Limit

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Representation LearningMulti-Index ModelsTwo-Layer NetworkGradient DescentSample Complexity

In deep learning, a central issue is to understand how neural networks efficiently learn high-dimensional features. To this end, we explore the gradient descent learning of a general Gaussian Multi-index model $f(\boldsymbol{x})=g(\boldsymbol{U}\boldsymbol{x})$ with hidden subspace $\boldsymbol{U}\in \mathbb{R}^{r\times d}$ , which is the canonical setup to study representation learning. We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with $o_d(1)$ test error using $\widetilde{\mathcal{O}}(d)$ samples and $\widetilde{\mathcal{O}}(d^2)$ time. The sample and time complexity both align with the information-theoretic limit up to leading order and are therefore optimal. During the first stage of gradient descent learning, the proof proceeds via showing that the inner weights can perform a power-iteration process. This process implicitly mimics a spectral start for the whole span of the hidden subspace and eventually eliminates finite-sample noise and recovers this span. It surprisingly indicates that optimal results can only be achieved if the first layer is trained for more than $\mathcal{O}(1)$ steps. This work demonstrates the ability of neural networks to effectively learn hierarchical functions with respect to both sample and time efficiency.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper establishes near-optimal sample and time complexity for learning Gaussian multi-index models via layer-wise gradient descent, achieving Õ(d) samples and Õ(d²) time that match information-theoretic limits. It resides in the 'Information-Theoretic Optimality and Complexity Bounds' leaf alongside two sibling papers within the 'Theoretical Foundations and Sample Complexity' branch. This leaf represents a focused research direction with only three papers total, indicating a relatively sparse but foundational area concerned with proving optimality guarantees rather than exploring algorithmic variants or training dynamics.

The taxonomy reveals neighboring research directions that contextualize this work. The sibling 'Computational Hardness and Learnability Limits' leaf (two papers) explores fundamental barriers, while the 'Training Dynamics and Gradient Flow Analysis' branch (seven papers across four leaves) examines temporal evolution and convergence properties. The paper bridges these areas by proving optimal complexity while analyzing gradient descent dynamics, particularly through its power-iteration mechanism. Its position in Theoretical Foundations distinguishes it from algorithmic contributions in the 'Algorithm Design' branch (seven papers) and application-focused work in 'Specialized Models' (eight papers).

Among twenty-nine candidates examined via limited semantic search, none clearly refute the three main contributions. The first contribution (near-optimal learning) examined ten candidates with zero refutations; the second (power-iteration mechanism) examined ten with zero refutations; the third (necessity of diverging first-layer steps) examined nine with zero refutations. This suggests that within the examined scope, the specific combination of optimal complexity guarantees, mechanistic analysis via power iteration, and the necessity result for first-layer training appears novel. However, the limited search scale means potentially relevant prior work outside the top-K matches may exist.

Based on the examined literature, the work appears to make substantive theoretical contributions at the intersection of optimality theory and training dynamics. The analysis covers top-thirty semantic matches plus citation expansion, providing reasonable confidence within this scope but not exhaustive coverage of the broader deep learning theory literature. The sparse population of the information-theoretic optimality leaf and absence of refutations among examined candidates suggest meaningful novelty, though comprehensive assessment would require broader search.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: gradient descent learning of multi-index models with neural networks. The field centers on understanding when and how neural networks can efficiently learn functions that depend on a small number of projections (indices) of high-dimensional inputs. The taxonomy reveals several complementary perspectives: Theoretical Foundations and Sample Complexity investigates information-theoretic limits and statistical requirements, often asking what sample sizes are necessary or sufficient for learning; Training Dynamics and Gradient Flow Analysis examines the evolution of parameters during optimization, characterizing convergence behavior and implicit biases; Algorithm Design and Methodological Approaches develops practical training schemes and computational strategies; Specialized Models and Applications tailors these ideas to particular architectures or problem domains; and Broader Perspectives synthesizes open questions across these threads. Representative works such as Gaussian Multi-Index Gradient Flow[2] and Gaussian Multi-Index Two-Timescale[3] illustrate how gradient-based methods can provably recover multi-index structure under favorable conditions, while studies like Multi-Index Time Complexity[4] and Reusing Batches Breaking Curse[5] explore computational and sample efficiency trade-offs. A central tension emerges between what is information-theoretically possible and what gradient descent can achieve in practice. Many studies focus on Gaussian or benign data distributions where gradient flow provably learns the correct indices, yet computational hardness results such as Ridge Combinations Computational Hardness[6] and Weak Learnability Computational Limits[9] suggest that worst-case instances may resist efficient learning. Within this landscape, Neural Multi-Index Information Limit[0] sits squarely in the information-theoretic optimality branch, establishing fundamental sample complexity bounds that any learner must respect. This contrasts with nearby works like Generic Multi-Index Information Limit[13], which may explore broader distributional settings, and Generative Leap Gaussian Multi-Index[27], which could emphasize generative or algorithmic aspects. By delineating the minimal information requirements, Neural Multi-Index Information Limit[0] provides a benchmark against which algorithmic approaches—whether gradient-based or otherwise—can be measured, highlighting the gap between statistical possibility and computational feasibility.

Claimed Contributions

Near-optimal learning of multi-index models via layer-wise gradient descent

10 retrieved papers

The authors prove that a two-layer neural network trained with layer-wise gradient descent achieves information-theoretically optimal sample complexity (up to leading order) and time complexity for learning generic Gaussian multi-index models, matching the Θ(d) sample threshold and Θ(d²) time threshold.

10 retrieved papers

Power-iteration mechanism for feature learning in neural networks

10 retrieved papers

The authors show that during the first stage of gradient descent, the inner layer weights evolve similarly to power method iterations on the local Hessian, which allows the network to recover the entire hidden subspace by balancing noise suppression and signal preservation over a diverging number of steps.

10 retrieved papers

Necessity of training the first layer for diverging steps

9 retrieved papers

The authors establish that achieving optimal sample and time complexity requires training the first layer for a diverging (but sublogarithmic) number of steps, rather than a constant number, to properly balance noise elimination and full subspace recovery.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[13] Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit PDF

Bohan Zhang, Zihao Wang, Hengyu Fu, Jason D. Lee (2025)

[27] The Generative Leap: Tight Sample Complexity for Efficiently Learning Gaussian Multi-Index Models PDF

A Damian, JD Lee, J Bruna (0)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Near-optimal learning of multi-index models via layer-wise gradient descent

[1] Survey on Algorithms for multi-index models PDF

Cannot Refute

[2] On Learning Gaussian Multi-index Models with Gradient Flow PDF

Cannot Refute

[7] Low-dimensional functions are efficiently learnable under randomly biased distributions PDF

Cannot Refute

[10] Repetita iuvant: Data repetition allows sgd to learn high-dimensional multi-index functions PDF

Cannot Refute

[47] Online Learning of Neural Networks PDF

Cannot Refute

[48] Fundamental limits of learning in sequence multi-index models and deep attention networks: High-dimensional asymptotics and sharp thresholds PDF

Cannot Refute

[49] Learning multi-index models with neural networks via mean-field langevin dynamics PDF

Cannot Refute

[50] Omnipredicting Single-Index Models with Multi-index Models PDF

Cannot Refute

[51] Pruning is Optimal for Learning Sparse Features in High-Dimensions PDF

Cannot Refute

[52] Fast and Sample Near-Optimal Algorithms for Learning Multidimensional Histograms PDF

Cannot Refute

Contribution

Power-iteration mechanism for feature learning in neural networks

[28] High-dimensional optimization for multi-spiked tensor pca PDF

Cannot Refute

[29] Commonality and Individuality-Based Subspace Learning PDF

Cannot Refute

[30] Self-supervised deep multi-view subspace clustering PDF

Cannot Refute

[31] Portable Executable Header Based Ransomware Detection using Power Iteration and Artificial Neural Network PDF

Cannot Refute

[32] GSS: Graph-based subspace learning with shots initialization for few-shot recognition PDF

Cannot Refute

[33] Removal of hidden units and weights for back propagation networks PDF

Cannot Refute

[34] CP-decomposition with Tensor Power Method for Convolutional Neural Networks compression PDF

Cannot Refute

[35] Understanding the message passing in graph neural networks via power iteration clustering PDF

Cannot Refute

[36] Generative Restricted Kernel Machines. PDF

Cannot Refute

[37] An efficient matrix factorization based low-rank representation for subspace clustering PDF

Cannot Refute

Contribution

Necessity of training the first layer for diverging steps

[38] Time complexity in deep learning models PDF

Cannot Refute

[39] Survey of Optimization Algorithms in Modern Neural Networks PDF

Cannot Refute

[40] SHAP Informed Neural Network PDF

Cannot Refute

[41] Local masking meets progressive freezing: crafting efficient vision transformers for self-supervised learning PDF

Cannot Refute

[42] An intelligent personalized web user information retrieval using partial least squares and artificial neural networks PDF

Cannot Refute

[43] SGC-ARANet: scale-wise global contextual axile reverse attention network for automatic brain tumor segmentation PDF

Cannot Refute

[44] Evaluating Different Malware Detection Neural Network Architectures PDF

Cannot Refute

[45] TrAct: Making First-layer Pre-Activations Trainable PDF

Cannot Refute

[46] Weight Initialization for Convolutional Neural Networks Using Unsupervised Machine Learning PDF

Cannot Refute

Neural Networks Learn Multi-Index Models Near the Information-Theoretic Limit

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[13] Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit PDF

[27] The Generative Leap: Tight Sample Complexity for Efficiently Learning Gaussian Multi-Index Models PDF

Contribution Analysis

Near-optimal learning of multi-index models via layer-wise gradient descent

[1] Survey on Algorithms for multi-index models PDF

[2] On Learning Gaussian Multi-index Models with Gradient Flow PDF

[7] Low-dimensional functions are efficiently learnable under randomly biased distributions PDF

[10] Repetita iuvant: Data repetition allows sgd to learn high-dimensional multi-index functions PDF

[47] Online Learning of Neural Networks PDF

[48] Fundamental limits of learning in sequence multi-index models and deep attention networks: High-dimensional asymptotics and sharp thresholds PDF

[49] Learning multi-index models with neural networks via mean-field langevin dynamics PDF

[50] Omnipredicting Single-Index Models with Multi-index Models PDF

[51] Pruning is Optimal for Learning Sparse Features in High-Dimensions PDF

[52] Fast and Sample Near-Optimal Algorithms for Learning Multidimensional Histograms PDF

Power-iteration mechanism for feature learning in neural networks

[28] High-dimensional optimization for multi-spiked tensor pca PDF

[29] Commonality and Individuality-Based Subspace Learning PDF

[30] Self-supervised deep multi-view subspace clustering PDF

[31] Portable Executable Header Based Ransomware Detection using Power Iteration and Artificial Neural Network PDF

[32] GSS: Graph-based subspace learning with shots initialization for few-shot recognition PDF

[33] Removal of hidden units and weights for back propagation networks PDF

[34] CP-decomposition with Tensor Power Method for Convolutional Neural Networks compression PDF

[35] Understanding the message passing in graph neural networks via power iteration clustering PDF

[36] Generative Restricted Kernel Machines. PDF

[37] An efficient matrix factorization based low-rank representation for subspace clustering PDF

Necessity of training the first layer for diverging steps

[38] Time complexity in deep learning models PDF

[39] Survey of Optimization Algorithms in Modern Neural Networks PDF

[40] SHAP Informed Neural Network PDF

[41] Local masking meets progressive freezing: crafting efficient vision transformers for self-supervised learning PDF

[42] An intelligent personalized web user information retrieval using partial least squares and artificial neural networks PDF

[43] SGC-ARANet: scale-wise global contextual axile reverse attention network for automatic brain tumor segmentation PDF

[44] Evaluating Different Malware Detection Neural Network Architectures PDF

[45] TrAct: Making First-layer Pre-Activations Trainable PDF

[46] Weight Initialization for Convolutional Neural Networks Using Unsupervised Machine Learning PDF

Table of Contents