Training Overparametrized Neural Networks in Sublinear Time

ICLR 2026 Conference SubmissionAnonymous Authors
neural network trainingtheoryoverparametrized neural networks
Abstract:

The success of deep learning comes at a tremendous computational and energy cost, and the scalability of training massively overparametrized neural networks is becoming a real barrier to the progress of artificial intelligence (AI). Despite the popularity and low cost-per-iteration of traditional backpropagation via gradient decent, stochastic gradient descent (SGD) has prohibitive convergence rate in non-convex settings, both in theory and practice. To mitigate this cost, recent works have proposed to employ alternative (Newton-type) training methods with much faster convergence rate, albeit with higher cost-per-iteration. For a neural network with m=poly(n)m=\mathrm{poly}(n) parameters and input batch of nn datapoints in Rd\mathbb{R}^d, the previous work of [Brand, Peng, Song and Weinstein, ITCS 2021] requires mnd+n3\sim mnd + n^3 time per iteration. In this paper, we present a novel training method that requires only m1αnd+n3m^{1-\alpha} n d + n^3 amortized time in the same overparametrized regime, where α(0.01,1)\alpha \in (0.01,1) is some fixed constant. This method relies on a new and alternative view of neural networks, as a set of binary search trees, where each iteration corresponds to modifying a small subset of the nodes in the tree. We believe this view would have further applications in the design and analysis of deep neural networks (DNNs). Finally, we give a lower bound for the dynamic sensitive weight searching data structure we make use of, showing that under SETH{\sf SETH} or OVC{\sf OVC} from fine-grained complexity, one cannot substantially improve our algorithm.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a novel training method achieving m^(1-α)nd + n³ amortized time per iteration for overparametrized neural networks, improving upon the mnd + n³ complexity of prior work. It resides in the 'Near-Linear and Sublinear Second-Order Methods' leaf, which contains only one sibling paper. This leaf sits within the broader 'Sublinear-Time Training Algorithms' branch, which encompasses three distinct approaches: second-order methods, preprocessing-based acceleration, and sketching/sampling techniques. The sparse population of this specific leaf suggests the paper targets a relatively focused research direction within the larger field of training acceleration.

The taxonomy reveals neighboring research directions that contextualize this work. Sibling leaves include 'Preprocessing-Based Acceleration' and 'Sketching and Sampling for Scalable Training', which pursue sublinear complexity through different mechanisms—preprocessing network structure versus randomized dimensionality reduction. The broader 'Convergence Analysis' branch addresses theoretical guarantees for first-order methods and reinforcement learning settings, while 'Efficient Training Infrastructure' focuses on distributed systems and network pruning. The paper's binary search tree representation appears to bridge algorithmic innovation with data structure design, potentially connecting to complexity analysis work in the infrastructure branch.

Among 21 candidates examined, the sublinear-time training algorithm contribution shows substantial prior work overlap, with 3 of 10 examined candidates appearing refutable. The binary search tree representation contribution examined 10 candidates with none clearly refuting it, suggesting greater novelty in this conceptual framing. The lower bound contribution examined only 1 candidate, which appears refutable, though the limited search scope makes definitive assessment difficult. The statistics indicate that while the core algorithmic speedup builds on established second-order method research, the data structure perspective may offer a more distinctive angle within the examined literature.

Based on the top-21 semantic matches and citation expansion, the work appears to advance an active but not densely populated research direction. The taxonomy structure shows the field has diversified into multiple acceleration strategies, and this paper's position in a sparsely populated leaf suggests it pursues a specific technical approach rather than entering a crowded subfield. The limited search scope means potentially relevant work in adjacent areas—such as preprocessing methods or sketching techniques—may not have been fully captured in the contribution-level analysis.

Taxonomy

Core-task Taxonomy Papers
14
3
Claimed Contributions
21
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: Accelerating training of overparametrized neural networks using sublinear time methods. The field addresses the computational bottleneck of training modern neural networks, which often contain far more parameters than training samples. The taxonomy organizes research into three main branches: Sublinear-Time Training Algorithms, which develop methods that reduce per-iteration complexity below linear time in network size; Convergence Analysis of Overparametrized Networks, which provides theoretical guarantees for optimization in the overparametrized regime; and Efficient Training Infrastructure, which focuses on practical systems and hardware considerations. Within the algorithmic branch, researchers explore diverse approaches including sketching techniques like Sketch-GNN[2], preprocessing strategies such as Preprocessing Overparametrized[3], and second-order methods that exploit problem structure. The convergence analysis branch examines conditions under which gradient-based methods succeed, including studies of the optimization landscape like Dynamical View Optimization[4] and acceleration techniques such as Nesterov Acceleration[5], while also addressing specific settings like Locally Polyak-Lojasiewicz[6] conditions and complexity bounds for large language models in Gradient Complexity LLMs[7]. A particularly active line of work centers on near-linear and sublinear second-order methods, which aim to harness curvature information without the cubic cost of traditional Newton-type approaches. Near-Linear Training[1] exemplifies efforts to achieve near-linear time complexity per iteration while maintaining strong convergence properties. Sublinear Training[0] pushes this frontier further by targeting truly sublinear complexity, positioning itself as an advancement over Near-Linear Training[1] in terms of per-iteration cost. This cluster contrasts with first-order methods that rely solely on gradient information, trading off the simplicity of gradient descent for potentially faster convergence through more sophisticated linear algebra. Meanwhile, works like Entropic Sparsification[12] and LSH Softmax[14] explore complementary sparsification and hashing strategies that reduce computational burden in specific network components. The central tension across these branches involves balancing theoretical guarantees, practical wall-clock speedups, and the overhead introduced by advanced algorithmic machinery.

Claimed Contributions

Sublinear-time training algorithm for overparametrized neural networks

The authors propose a training algorithm for two-layer ReLU neural networks with m = poly(n) parameters that achieves sublinear cost-per-iteration of m^(1-α)nd + n^3, improving over the previous linear-time O(mnd + n^3) per iteration while maintaining the same fast convergence rate as second-order methods.

10 retrieved papers
Can Refute
Binary search tree representation of neural networks

The authors introduce a novel representation of neural networks as collections of binary search trees, where the relationship between network weights and training data is encoded using trees. This view enables efficient training by exploiting sparsity of activations to update only a small subset of tree nodes per iteration.

10 retrieved papers
Lower bound for dynamic sensitive weight searching data structure

The authors establish a conditional lower bound showing that under the Strong Exponential Time Hypothesis or Orthogonal Vector Conjecture, their dynamic sensitive weight searching data structure cannot be substantially improved, demonstrating near-optimality of their approach.

1 retrieved paper
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Sublinear-time training algorithm for overparametrized neural networks

The authors propose a training algorithm for two-layer ReLU neural networks with m = poly(n) parameters that achieves sublinear cost-per-iteration of m^(1-α)nd + n^3, improving over the previous linear-time O(mnd + n^3) per iteration while maintaining the same fast convergence rate as second-order methods.

Contribution

Binary search tree representation of neural networks

The authors introduce a novel representation of neural networks as collections of binary search trees, where the relationship between network weights and training data is encoded using trees. This view enables efficient training by exploiting sparsity of activations to update only a small subset of tree nodes per iteration.

Contribution

Lower bound for dynamic sensitive weight searching data structure

The authors establish a conditional lower bound showing that under the Strong Exponential Time Hypothesis or Orthogonal Vector Conjecture, their dynamic sensitive weight searching data structure cannot be substantially improved, demonstrating near-optimality of their approach.

Training Overparametrized Neural Networks in Sublinear Time | Novelty Validation