Fast Convergence of Natural Gradient Descent for Over-parameterized Physics-Informed Neural Networks

ICLR 2026 Conference SubmissionAnonymous Authors
natural gradient descentover-parameterizationphysics-informed neural networksneural tangent kernel
Abstract:

In the context of over-parameterization, there is a line of work demonstrating that randomly initialized (stochastic) gradient descent (GD) converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. However, the convergence rate of GD for training two-layer neural networks exhibits poor dependence on the sample size and the Gram matrix, leading to a slow training process. In this paper, we show that for training two-layer ReLU3\text{ReLU}^3 Physics-Informed Neural Networks (PINNs), the learning rate can be improved from the smallest eigenvalue of the limiting Gram matrix to the reciprocal of the largest eigenvalue, implying that GD actually enjoys a faster convergence rate. Despite such improvements, the convergence rate is still tied to the least eigenvalue of the Gram matrix, leading to slow convergence. We then develop the positive definiteness of Gram matrices with general smooth activation functions and provide the convergence analysis of natural gradient descent (NGD) in training two-layer PINNs, demonstrating that the maximal learning rate can be O(1)\mathcal{O}(1) and at this rate, the convergence rate is independent of the Gram matrix. In particular, for smooth activation functions, the convergence rate of NGD is quadratic. Numerical experiments are conducted to verify our theoretical results.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes improved convergence analysis for gradient descent and natural gradient descent in over-parameterized two-layer ReLU³ PINNs, demonstrating faster learning rates independent of the Gram matrix's smallest eigenvalue. It resides in the 'Over-parameterized Regime Analysis' leaf under 'Theoretical Convergence Analysis', where it is currently the sole paper. This positioning indicates a sparse research direction within the taxonomy, suggesting the specific focus on over-parameterized PINNs with rigorous convergence guarantees represents relatively unexplored territory in the surveyed literature.

The taxonomy reveals neighboring work primarily in algorithmic variants rather than theoretical analysis. The sibling leaf 'Simplified Model Analysis' contains one paper examining quadratic approximations, while the broader 'Algorithmic Variants and Computational Efficiency' branch houses multiple papers on dual formulations, energy metrics, and preconditioning techniques. The paper's theoretical focus on learning rate bounds and Gram matrix dependencies distinguishes it from these computational approaches, though connections exist through shared interest in natural gradient methods. The taxonomy's scope and exclude notes clarify that full nonlinear PDE convergence analysis belongs in this leaf, separating it from simplified models or purely empirical studies.

Among twenty-seven candidates examined, the contribution-level statistics reveal mixed novelty signals. The improved gradient descent analysis examined ten candidates with zero refutations, suggesting this specific learning rate improvement may be novel within the search scope. The Gram matrix positive definiteness framework examined seven candidates and found one refutable match, indicating some overlap with prior theoretical work on matrix properties. The natural gradient descent convergence analysis also examined ten candidates without refutation. These statistics reflect a limited semantic search scope rather than exhaustive coverage, meaning unexamined literature could contain relevant prior work.

The analysis suggests moderate novelty given the constrained search scope. The paper's theoretical contributions appear relatively fresh within the examined candidate pool, particularly regarding learning rate improvements for standard gradient descent. However, the single refutation for Gram matrix analysis and the sparse taxonomy leaf indicate both potential overlap with existing theory and limited prior work in this specific over-parameterized PINN setting. A broader literature search beyond top-thirty semantic matches would be needed to assess novelty more definitively.

Taxonomy

Core-task Taxonomy Papers
18
3
Claimed Contributions
27
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: convergence analysis of natural gradient descent for physics-informed neural networks. The field structure reflects a multi-faceted investigation into how natural gradient methods can be rigorously understood and effectively deployed for solving partial differential equations via neural networks. The taxonomy organizes work into four main branches: Theoretical Convergence Analysis examines formal guarantees and convergence rates under various assumptions (including over-parameterized regimes); Algorithmic Variants and Computational Efficiency explores practical modifications such as sketching techniques, dual formulations, and Hessian-free implementations to reduce computational overhead; Empirical Validation and Applications demonstrates performance on benchmark problems; and Geometric and Theoretical Foundations studies the underlying mathematical structures, including Fisher information geometry and implicit function perspectives. Representative works span from rigorous theory like Fast Convergence Rates for[8] to computational innovations such as Near-optimal Sketchy Natural Gradients[1] and Dual Natural Gradient Descent[2]. A particularly active line of work focuses on balancing theoretical rigor with computational tractability. Many studies investigate how natural gradient methods achieve faster convergence than standard gradient descent by exploiting the Fisher information metric, yet face challenges in computing or approximating the Fisher matrix efficiently. Works like Gauss-Newton Natural Gradient Descent[3] and TENG[5] propose algorithmic refinements to mitigate these costs, while others such as Dual Cone Gradient Descent[4] explore alternative geometric perspectives. Fast Convergence of Natural[0] situates itself within the Theoretical Convergence Analysis branch, specifically addressing over-parameterized regimes where neural networks are sufficiently expressive. Its emphasis on proving fast convergence rates complements nearby algorithmic studies like Gauss-Newton Natural Gradient Descent[3], which also targets convergence but with a focus on practical Gauss-Newton approximations, and contrasts with purely empirical demonstrations such as Achieving high accuracy with[7]. The interplay between provable guarantees and computational feasibility remains a central open question across these branches.

Claimed Contributions

Improved convergence analysis of gradient descent for over-parameterized PINNs

The authors develop a refined convergence analysis for gradient descent in training two-layer Physics-Informed Neural Networks. They improve the learning rate requirement from O(λ₀) to O(1/λₘₐₓ) and reduce the network width requirement from Ω((n₁+n₂)²/λ₀⁴δ³) to Ω(1/λ₀⁴ log(n₁+n₂/δ)), using a new recursion formula for the gradient descent dynamics.

10 retrieved papers
Framework for positive definiteness of Gram matrices with smooth activation functions

The authors establish a general framework proving that Gram matrices remain strictly positive definite for various smooth activation functions (logistic, softplus, hyperbolic tangent, swish, etc.) in the PINN setting. This result extends beyond the specific PDE considered and applies to other PDE forms.

7 retrieved papers
Can Refute
Convergence analysis of natural gradient descent for over-parameterized PINNs

The authors prove that natural gradient descent converges to global optima for two-layer PINNs with either ReLU³ or smooth activation functions. The learning rate can be O(1), making the convergence rate independent of sample size and the smallest eigenvalue of the Gram matrix. For smooth activations, NGD achieves quadratic convergence.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Improved convergence analysis of gradient descent for over-parameterized PINNs

The authors develop a refined convergence analysis for gradient descent in training two-layer Physics-Informed Neural Networks. They improve the learning rate requirement from O(λ₀) to O(1/λₘₐₓ) and reduce the network width requirement from Ω((n₁+n₂)²/λ₀⁴δ³) to Ω(1/λ₀⁴ log(n₁+n₂/δ)), using a new recursion formula for the gradient descent dynamics.

Contribution

Framework for positive definiteness of Gram matrices with smooth activation functions

The authors establish a general framework proving that Gram matrices remain strictly positive definite for various smooth activation functions (logistic, softplus, hyperbolic tangent, swish, etc.) in the PINN setting. This result extends beyond the specific PDE considered and applies to other PDE forms.

Contribution

Convergence analysis of natural gradient descent for over-parameterized PINNs

The authors prove that natural gradient descent converges to global optima for two-layer PINNs with either ReLU³ or smooth activation functions. The learning rate can be O(1), making the convergence rate independent of sample size and the smallest eigenvalue of the Gram matrix. For smooth activations, NGD achieves quadratic convergence.