Variational Deep Learning via Implicit Regularization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Implicit RegularizationBayesian Deep LearningGeneralized Variational InferenceImplicit Bias of SGD

Modern deep learning models generalize remarkably well in-distribution, despite being overparametrized and trained with little to no explicit regularization. Instead, current theory credits implicit regularization imposed by the choice of architecture, hyperparameters and optimization procedure. However, deep neural networks can be surprisingly non-robust, resulting in overconfident predictions and poor out-of-distribution generalization. Bayesian deep learning addresses this via model averaging, but typically requires significant computational resources as well as carefully elicited priors to avoid overriding the benefits of implicit regularization. Instead, in this work, we propose to regularize variational neural networks solely by relying on the implicit bias of (stochastic) gradient descent. We theoretically characterize this inductive bias in overparametrized linear models as generalized variational inference and demonstrate the importance of the choice of parametrization. Empirically, our approach demonstrates strong in- and out-of-distribution performance without additional hyperparameter tuning and with minimal computational overhead.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Implicit Bias Variational Inference (IBVI), which regularizes variational neural networks by relying solely on the implicit bias of stochastic gradient descent rather than explicit priors or hyperparameter tuning. It resides in the 'Implicit Regularization as Variational Inference' leaf, which contains only three papers total (including this one). This leaf sits within the broader 'Theoretical Foundations of Implicit Regularization' branch, indicating the work occupies a relatively sparse research direction focused on establishing formal connections between optimization dynamics and Bayesian inference frameworks.

The taxonomy reveals neighboring leaves addressing related but distinct perspectives: 'Implicit Regularization in Wide and Overparametrized Networks' examines learning dynamics without explicit variational framing, while 'General Theoretical Perspectives on Bayesian Deep Learning' provides broader reviews. The sibling papers in the same leaf share the goal of interpreting gradient descent as variational inference, but the taxonomy's scope notes clarify that this leaf excludes purely empirical applications (which belong in 'Applied Methods') and meta-learning extensions. The paper thus connects to theoretical characterizations of implicit bias while diverging from explicit sampling methods found in the 'Variational Inference and Gradient-Based Sampling Methods' branch.

Among 25 candidates examined across three contributions, the IBVI method shows one refutable candidate out of 10 examined, suggesting some prior work addresses similar algorithmic ideas. The theoretical characterization of implicit bias as generalized variational inference examined 5 candidates with none refutable, indicating this formalization may offer fresh perspective. The extension of maximal update parametrization to probabilistic networks examined 10 candidates with none refutable, suggesting this parametrization choice is relatively unexplored in the Bayesian setting. The limited search scope (25 candidates, not exhaustive) means these assessments reflect top semantic matches rather than comprehensive field coverage.

Based on the top-25 semantic matches examined, the work appears to contribute novel theoretical framing and parametrization insights within a sparse research direction, though the IBVI method itself encounters some prior overlap. The taxonomy structure confirms this area remains less crowded than applied uncertainty quantification branches, which contain more papers. The analysis does not cover broader optimization literature or recent preprints outside the search scope, so the novelty assessment remains provisional pending deeper investigation of gradient-based Bayesian methods.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Bayesian deep learning via implicit regularization of gradient descent. This field explores how standard gradient-based optimization in neural networks can be understood through a Bayesian lens, where the training dynamics themselves induce implicit priors and approximate posterior inference. The taxonomy reveals several complementary perspectives: theoretical foundations examine the mathematical underpinnings of implicit regularization and its connection to variational inference, while variational inference and gradient-based sampling methods develop explicit algorithms that bridge optimization and probabilistic reasoning. Meta-learning branches investigate how gradient-based updates can encode prior knowledge across tasks, and applied methods focus on practical uncertainty quantification for real-world prediction problems. Additional branches address privacy-preserving architectures and model compression, reflecting the need to deploy Bayesian principles in resource-constrained or sensitive settings. Representative works such as Gradient Regularization Inference[7] and Implicitly Bayesian Prediction[10] illustrate how gradient descent can be reinterpreted as performing approximate Bayesian updates, while studies like Loss Landscapes Generalization[3] connect optimization trajectories to generalization behavior. A central tension in this landscape concerns whether implicit regularization alone suffices for reliable uncertainty estimates or whether explicit variational frameworks are necessary. Some lines of work, including Variational Deep Learning[0], argue that viewing gradient descent as variational inference provides a principled foundation for Bayesian deep learning, closely aligning with Gradient Regularization Inference[7] and Implicitly Bayesian Prediction[10] in emphasizing the implicit Bayesian character of standard training. In contrast, methods like Accelerating SVGD[15] and Gradient-bridged Posterior[12] develop explicit sampling or variational schemes to obtain richer posterior approximations. Variational Deep Learning[0] sits within the theoretical branch that interprets optimization dynamics as variational inference, sharing conceptual ground with its immediate neighbors but differing in how it formalizes the connection between gradient flow and posterior approximation. This positioning highlights ongoing debates about whether implicit biases of optimizers can replace or merely complement explicit Bayesian machinery for uncertainty-aware learning.

Claimed Contributions

Implicit Bias Variational Inference (IBVI) method

Can Refute

10 retrieved papers

The authors introduce a method for training variational neural networks by maximizing the expected log-likelihood without explicit KL regularization to the prior. Instead, the method exploits the implicit regularization of SGD to prevent uncertainty collapse and achieve robust generalization.

10 retrieved papers

Can Refute

Theoretical characterization of implicit bias as generalized variational inference

5 retrieved papers

The authors prove that for overparametrized linear models, the implicit bias of SGD when training via the expected loss is equivalent to generalized variational inference with a 2-Wasserstein regularizer penalizing deviations from the prior, extending prior results for non-probabilistic models.

5 retrieved papers

Extension of maximal update parametrization to probabilistic networks

10 retrieved papers

The authors extend the maximal update parametrization (μP) to variational neural networks, enabling hyperparameter transfer from small to large models and ensuring feature learning even as network width increases, which is demonstrated empirically on CIFAR-10.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] Gradient regularization as approximate variational inference PDF

Ali ÃnlÃ¼, Laurence Aitchison (2021)

[10] Implicitly bayesian prediction rules in deep learning PDF

BK Mlodozeniec, RE Turner (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Implicit Bias Variational Inference (IBVI) method

[34] Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks PDF

Can Refute

[10] Implicitly bayesian prediction rules in deep learning PDF

Cannot Refute

[28] Stochastic gradient descent as approximate bayesian inference PDF

Cannot Refute

[29] Subspace inference for Bayesian deep learning PDF

Cannot Refute

[30] A simple baseline for bayesian uncertainty in deep learning PDF

Cannot Refute

[31] LLM Unlearning via Loss Adjustment with Only Forget Data PDF

Cannot Refute

[32] Semi-Implicit Variational Inference PDF

Cannot Refute

[33] Implicit bias of SGD in -regularized linear DNNs: One-way jumps from high to low rank PDF

Cannot Refute

[35] Neural Operator Variational Inference Based on Regularized Stein Discrepancy for Deep Gaussian Processes PDF

Cannot Refute

[36] Semi-Implicit Variational Inference via Kernelized Path Gradient Descent PDF

Cannot Refute

Contribution

Theoretical characterization of implicit bias as generalized variational inference

[23] On the Optimal Weighted Regularization in Overparameterized Linear Regression PDF

Cannot Refute

[24] Why do Overparameterized Neural Networks Generalize? PDF

Cannot Refute

[25] 3.2 Coresets and Sketches for Regression Problems on Data Streams and Distributed Data PDF

Cannot Refute

[26] Optimal Implicit Bias in Linear Regression PDF

Cannot Refute

[27] Computationally Efficient Posterior Inference with Langevin Monte Carlo and Early Stopping PDF

Cannot Refute

Contribution

Extension of maximal update parametrization to probabilistic networks

[37] Sparse maximal update parameterization: A holistic approach to sparse training dynamics PDF

Cannot Refute

[38] Tuning large neural networks via zero-shot hyperparameter transfer PDF

Cannot Refute

[39] u-P: The Unit-Scaled Maximal Update Parametrization PDF

Cannot Refute

[40] Scaling exponents across parameterizations and optimizers PDF

Cannot Refute

[41] Practical efficiency of muon for pretraining PDF

Cannot Refute

[42] Feature learning in infinite-depth neural networks PDF

Cannot Refute

[43] Maximal Update Parametrization and Zero-Shot Hyperparameter Transfer for Fourier Neural Operators PDF

Cannot Refute

[44] Deep Linear Network Training Dynamics from Random Initialization: Data, Width, Depth, and Hyperparameter Transfer PDF

Cannot Refute

[45] Scaling Diffusion Transformers Efficiently via Î¼P PDF

Cannot Refute

[46] LO: Compute-Efficient Meta-Generalization of Learned Optimizers PDF

Cannot Refute

Variational Deep Learning via Implicit Regularization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] Gradient regularization as approximate variational inference PDF

[10] Implicitly bayesian prediction rules in deep learning PDF

Contribution Analysis

Implicit Bias Variational Inference (IBVI) method

[34] Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks PDF

[10] Implicitly bayesian prediction rules in deep learning PDF

[28] Stochastic gradient descent as approximate bayesian inference PDF

[29] Subspace inference for Bayesian deep learning PDF

[30] A simple baseline for bayesian uncertainty in deep learning PDF

[31] LLM Unlearning via Loss Adjustment with Only Forget Data PDF

[32] Semi-Implicit Variational Inference PDF

[33] Implicit bias of SGD in -regularized linear DNNs: One-way jumps from high to low rank PDF

[35] Neural Operator Variational Inference Based on Regularized Stein Discrepancy for Deep Gaussian Processes PDF

[36] Semi-Implicit Variational Inference via Kernelized Path Gradient Descent PDF

Theoretical characterization of implicit bias as generalized variational inference

[23] On the Optimal Weighted Regularization in Overparameterized Linear Regression PDF

[24] Why do Overparameterized Neural Networks Generalize? PDF

[25] 3.2 Coresets and Sketches for Regression Problems on Data Streams and Distributed Data PDF

[26] Optimal Implicit Bias in Linear Regression PDF

[27] Computationally Efficient Posterior Inference with Langevin Monte Carlo and Early Stopping PDF

Extension of maximal update parametrization to probabilistic networks

[37] Sparse maximal update parameterization: A holistic approach to sparse training dynamics PDF

[38] Tuning large neural networks via zero-shot hyperparameter transfer PDF

[39] u-P: The Unit-Scaled Maximal Update Parametrization PDF

[40] Scaling exponents across parameterizations and optimizers PDF

[41] Practical efficiency of muon for pretraining PDF

[42] Feature learning in infinite-depth neural networks PDF

[43] Maximal Update Parametrization and Zero-Shot Hyperparameter Transfer for Fourier Neural Operators PDF

[44] Deep Linear Network Training Dynamics from Random Initialization: Data, Width, Depth, and Hyperparameter Transfer PDF

[45] Scaling Diffusion Transformers Efficiently via Î¼P PDF

[46] LO: Compute-Efficient Meta-Generalization of Learned Optimizers PDF

Table of Contents