Fast Convergence of Natural Gradient Descent for Over-parameterized Physics-Informed Neural Networks

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

natural gradient descentover-parameterizationphysics-informed neural networksneural tangent kernel

In the context of over-parameterization, there is a line of work demonstrating that randomly initialized (stochastic) gradient descent (GD) converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. However, the convergence rate of GD for training two-layer neural networks exhibits poor dependence on the sample size and the Gram matrix, leading to a slow training process. In this paper, we show that for training two-layer $\text{ReLU}^3$ Physics-Informed Neural Networks (PINNs), the learning rate can be improved from the smallest eigenvalue of the limiting Gram matrix to the reciprocal of the largest eigenvalue, implying that GD actually enjoys a faster convergence rate. Despite such improvements, the convergence rate is still tied to the least eigenvalue of the Gram matrix, leading to slow convergence. We then develop the positive definiteness of Gram matrices with general smooth activation functions and provide the convergence analysis of natural gradient descent (NGD) in training two-layer PINNs, demonstrating that the maximal learning rate can be $\mathcal{O}(1)$ and at this rate, the convergence rate is independent of the Gram matrix. In particular, for smooth activation functions, the convergence rate of NGD is quadratic. Numerical experiments are conducted to verify our theoretical results.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes improved convergence analysis for gradient descent and natural gradient descent in over-parameterized two-layer ReLU³ PINNs, demonstrating faster learning rates independent of the Gram matrix's smallest eigenvalue. It resides in the 'Over-parameterized Regime Analysis' leaf under 'Theoretical Convergence Analysis', where it is currently the sole paper. This positioning indicates a sparse research direction within the taxonomy, suggesting the specific focus on over-parameterized PINNs with rigorous convergence guarantees represents relatively unexplored territory in the surveyed literature.

The taxonomy reveals neighboring work primarily in algorithmic variants rather than theoretical analysis. The sibling leaf 'Simplified Model Analysis' contains one paper examining quadratic approximations, while the broader 'Algorithmic Variants and Computational Efficiency' branch houses multiple papers on dual formulations, energy metrics, and preconditioning techniques. The paper's theoretical focus on learning rate bounds and Gram matrix dependencies distinguishes it from these computational approaches, though connections exist through shared interest in natural gradient methods. The taxonomy's scope and exclude notes clarify that full nonlinear PDE convergence analysis belongs in this leaf, separating it from simplified models or purely empirical studies.

Among twenty-seven candidates examined, the contribution-level statistics reveal mixed novelty signals. The improved gradient descent analysis examined ten candidates with zero refutations, suggesting this specific learning rate improvement may be novel within the search scope. The Gram matrix positive definiteness framework examined seven candidates and found one refutable match, indicating some overlap with prior theoretical work on matrix properties. The natural gradient descent convergence analysis also examined ten candidates without refutation. These statistics reflect a limited semantic search scope rather than exhaustive coverage, meaning unexamined literature could contain relevant prior work.

The analysis suggests moderate novelty given the constrained search scope. The paper's theoretical contributions appear relatively fresh within the examined candidate pool, particularly regarding learning rate improvements for standard gradient descent. However, the single refutation for Gram matrix analysis and the sparse taxonomy leaf indicate both potential overlap with existing theory and limited prior work in this specific over-parameterized PINN setting. A broader literature search beyond top-thirty semantic matches would be needed to assess novelty more definitively.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: convergence analysis of natural gradient descent for physics-informed neural networks. The field structure reflects a multi-faceted investigation into how natural gradient methods can be rigorously understood and effectively deployed for solving partial differential equations via neural networks. The taxonomy organizes work into four main branches: Theoretical Convergence Analysis examines formal guarantees and convergence rates under various assumptions (including over-parameterized regimes); Algorithmic Variants and Computational Efficiency explores practical modifications such as sketching techniques, dual formulations, and Hessian-free implementations to reduce computational overhead; Empirical Validation and Applications demonstrates performance on benchmark problems; and Geometric and Theoretical Foundations studies the underlying mathematical structures, including Fisher information geometry and implicit function perspectives. Representative works span from rigorous theory like Fast Convergence Rates for[8] to computational innovations such as Near-optimal Sketchy Natural Gradients[1] and Dual Natural Gradient Descent[2]. A particularly active line of work focuses on balancing theoretical rigor with computational tractability. Many studies investigate how natural gradient methods achieve faster convergence than standard gradient descent by exploiting the Fisher information metric, yet face challenges in computing or approximating the Fisher matrix efficiently. Works like Gauss-Newton Natural Gradient Descent[3] and TENG[5] propose algorithmic refinements to mitigate these costs, while others such as Dual Cone Gradient Descent[4] explore alternative geometric perspectives. Fast Convergence of Natural[0] situates itself within the Theoretical Convergence Analysis branch, specifically addressing over-parameterized regimes where neural networks are sufficiently expressive. Its emphasis on proving fast convergence rates complements nearby algorithmic studies like Gauss-Newton Natural Gradient Descent[3], which also targets convergence but with a focus on practical Gauss-Newton approximations, and contrasts with purely empirical demonstrations such as Achieving high accuracy with[7]. The interplay between provable guarantees and computational feasibility remains a central open question across these branches.

Claimed Contributions

Improved convergence analysis of gradient descent for over-parameterized PINNs

10 retrieved papers

The authors develop a refined convergence analysis for gradient descent in training two-layer Physics-Informed Neural Networks. They improve the learning rate requirement from O(λ₀) to O(1/λₘₐₓ) and reduce the network width requirement from Ω((n₁+n₂)²/λ₀⁴δ³) to Ω(1/λ₀⁴ log(n₁+n₂/δ)), using a new recursion formula for the gradient descent dynamics.

10 retrieved papers

Framework for positive definiteness of Gram matrices with smooth activation functions

Can Refute

7 retrieved papers

The authors establish a general framework proving that Gram matrices remain strictly positive definite for various smooth activation functions (logistic, softplus, hyperbolic tangent, swish, etc.) in the PINN setting. This result extends beyond the specific PDE considered and applies to other PDE forms.

7 retrieved papers

Can Refute

Convergence analysis of natural gradient descent for over-parameterized PINNs

10 retrieved papers

The authors prove that natural gradient descent converges to global optima for two-layer PINNs with either ReLU³ or smooth activation functions. The learning rate can be O(1), making the convergence rate independent of sample size and the smallest eigenvalue of the Gram matrix. For smooth activations, NGD achieves quadratic convergence.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Improved convergence analysis of gradient descent for over-parameterized PINNs

[19] Gradient descent optimizes over-parameterized deep ReLU networks PDF

Cannot Refute

[36] Convergence guarantees for gradient descent in deep neural networks with non-convex loss functions PDF

Cannot Refute

[37] An improved analysis of training over-parameterized deep neural networks PDF

Cannot Refute

[38] Super-convergence: Very fast training of neural networks using large learning rates PDF

Cannot Refute

[39] How does learning rate decay help modern neural networks? PDF

Cannot Refute

[40] Convergence analysis and trajectory comparison of gradient descent for overparameterized deep linear networks PDF

Cannot Refute

[41] Learning over-parametrized two-layer neural networks beyond ntk PDF

Cannot Refute

[42] Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural Networks PDF

Cannot Refute

[43] The large learning rate phase of deep learning: the catapult mechanism PDF

Cannot Refute

[44] A framework for overparameterized learning PDF

Cannot Refute

Contribution

Framework for positive definiteness of Gram matrices with smooth activation functions

[21] Gradient descent finds the global optima of two-layer physics-informed neural networks PDF

Can Refute

[19] Gradient descent optimizes over-parameterized deep ReLU networks PDF

Cannot Refute

[20] A random matrix approach to neural networks PDF

Cannot Refute

[22] Effect of Activation Functions on the Training of Overparametrized Neural Nets PDF

Cannot Refute

[23] A non-parametric regression viewpoint: Generalization of overparametrized deep relu network under noisy observations PDF

Cannot Refute

[24] Convergence of Stochastic Gradient Methods for Wide Two-Layer Physics-Informed Neural Networks PDF

Cannot Refute

[25] On the Positive Definiteness of the Neural Tangent Kernel PDF

Cannot Refute

Contribution

Convergence analysis of natural gradient descent for over-parameterized PINNs

[26] Fast convergence of natural gradient descent for over-parameterized neural networks PDF

Cannot Refute

[27] Understanding approximate fisher information for fast convergence of natural gradient descent in wide neural networks PDF

Cannot Refute

[28] Riemannian natural gradient methods PDF

Cannot Refute

[29] Neural policy gradient methods: Global optimality and rates of convergence PDF

Cannot Refute

[30] Exact, tractable gauss-newton optimization in deep reversible architectures reveal poor generalization PDF

Cannot Refute

[31] On monotonic linear interpolation of neural network parameters PDF

Cannot Refute

[32] Overparametrization in qaoa PDF

Cannot Refute

[33] Exact natural gradient in deep linear networks and its application to the nonlinear case PDF

Cannot Refute

[34] Lecture Notes: Computational Mathematics and AI PDF

Cannot Refute

[35] Implicit Regularization of Hyperparameters in Deep Learning: Beyond Convexity and Small Steps PDF

Cannot Refute

Fast Convergence of Natural Gradient Descent for Over-parameterized Physics-Informed Neural Networks

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Improved convergence analysis of gradient descent for over-parameterized PINNs

[19] Gradient descent optimizes over-parameterized deep ReLU networks PDF

[36] Convergence guarantees for gradient descent in deep neural networks with non-convex loss functions PDF

[37] An improved analysis of training over-parameterized deep neural networks PDF

[38] Super-convergence: Very fast training of neural networks using large learning rates PDF

[39] How does learning rate decay help modern neural networks? PDF

[40] Convergence analysis and trajectory comparison of gradient descent for overparameterized deep linear networks PDF

[41] Learning over-parametrized two-layer neural networks beyond ntk PDF

[42] Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural Networks PDF

[43] The large learning rate phase of deep learning: the catapult mechanism PDF

[44] A framework for overparameterized learning PDF

Framework for positive definiteness of Gram matrices with smooth activation functions

[21] Gradient descent finds the global optima of two-layer physics-informed neural networks PDF

[19] Gradient descent optimizes over-parameterized deep ReLU networks PDF

[20] A random matrix approach to neural networks PDF

[22] Effect of Activation Functions on the Training of Overparametrized Neural Nets PDF

[23] A non-parametric regression viewpoint: Generalization of overparametrized deep relu network under noisy observations PDF

[24] Convergence of Stochastic Gradient Methods for Wide Two-Layer Physics-Informed Neural Networks PDF

[25] On the Positive Definiteness of the Neural Tangent Kernel PDF

Convergence analysis of natural gradient descent for over-parameterized PINNs

[26] Fast convergence of natural gradient descent for over-parameterized neural networks PDF

[27] Understanding approximate fisher information for fast convergence of natural gradient descent in wide neural networks PDF

[28] Riemannian natural gradient methods PDF

[29] Neural policy gradient methods: Global optimality and rates of convergence PDF

[30] Exact, tractable gauss-newton optimization in deep reversible architectures reveal poor generalization PDF

[31] On monotonic linear interpolation of neural network parameters PDF

[32] Overparametrization in qaoa PDF

[33] Exact natural gradient in deep linear networks and its application to the nonlinear case PDF

[34] Lecture Notes: Computational Mathematics and AI PDF

[35] Implicit Regularization of Hyperparameters in Deep Learning: Beyond Convexity and Small Steps PDF

Table of Contents