Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Adamimplicit biasseparable dataadaptive algorithmsmini-batch

Adam [Kingma & Ba, 2015] is the de facto optimizer in deep learning, yet its theoretical understanding remains limited. Prior analyses show that Adam favors solutions aligned with $\ell_\infty$ -geometry, but these results are restricted to the full-batch regime. In this work, we study the implicit bias of incremental Adam (using one sample per step) for logistic regression on linearly separable data, and show that its bias can deviate from the full-batch behavior. As an extreme example, we construct datasets on which incremental Adam provably converges to the $\ell_2$ -max-margin classifier, in contrast to the $\ell_\infty$ -max-margin bias of full-batch Adam. For general datasets, we characterize its bias using a proxy algorithm for the $\beta_2 \to 1$ limit. This proxy maximizes a data-adaptive Mahalanobis-norm margin, whose associated covariance matrix is determined by a data-dependent dual fixed-point formulation. We further present concrete datasets where this bias reduces to the standard $\ell_2$ - and $\ell_\infty$ -max-margin classifiers. As a counterpoint, we prove that Signum [Bernstein et al., 2018] converges to the $\ell_\infty$ -max-margin classifier for any batch size. Overall, our results highlight that the implicit bias of Adam crucially depends on both the batching scheme and the dataset, while Signum remains invariant.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper characterizes the implicit bias of incremental Adam (one sample per step) on linearly separable logistic regression, showing that batch size fundamentally alters the solution geometry. It resides in the 'Adam Variants and Batch-Size Dependence' leaf, which contains only two papers total. This sparse taxonomy leaf suggests the specific question of how Adam's bias varies with batching schemes remains relatively unexplored, positioning the work in a niche but emerging research direction within the broader study of adaptive gradient methods.

The taxonomy reveals two main branches: Adaptive Gradient Methods with Momentum (where this paper sits) and Normalized Gradient Descent Methods. The sibling leaf 'Hyperparameter-Induced Implicit Regularization' examines how optimizer hyperparameters act as regularizers, while the neighboring 'Spectral and Momentum Steepest Descent' branch analyzes normalized methods converging to p-norm margins. The paper's focus on batch-size dependence distinguishes it from hyperparameter-centric analyses and connects to the broader question of how aggregation schemes (per-sample vs. mini-batch) interact with adaptive learning rates to shape solution geometry.

Among nineteen candidates examined, none clearly refute the three main contributions. The epoch-wise approximation framework (nine candidates examined, zero refutable) and the Scaled Rademacher construction proving ℓ₂-max-margin convergence (six candidates, zero refutable) appear novel within this limited search scope. The data-adaptive Mahalanobis-norm characterization via fixed-point formulation (four candidates examined) also shows no direct prior work among the candidates reviewed. The absence of refutable pairs suggests these technical contributions extend beyond what the top-K semantic matches and citation expansion captured, though this does not preclude relevant work outside the search scope.

Based on the limited literature search covering nineteen candidates, the work appears to introduce new theoretical machinery for understanding incremental Adam's bias. The sparse taxonomy leaf and zero refutable pairs indicate novelty within the examined scope, though the small candidate pool and narrow semantic search radius mean potentially relevant work in adjacent optimization theory or stochastic approximation may not have been surfaced.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: implicit bias of mini-batch Adam on separable data. The field structure divides into two main branches. The first, Adaptive Gradient Methods with Momentum, encompasses works that study how momentum-based optimizers like Adam behave in terms of their implicit regularization, particularly examining how batch size and per-sample versus mini-batch updates influence the solutions found. The second branch, Normalized Gradient Descent Methods, focuses on algorithms that normalize gradients in various ways, often revealing connections to margin maximization and directional convergence. These branches share a common interest in understanding what kind of solutions emerge when optimization is run to convergence on separable or overparameterized problems, yet they differ in whether momentum and adaptive learning rates or normalization mechanisms drive the implicit bias. Within the Adaptive Gradient Methods with Momentum branch, recent work has explored how different formulations of Adam lead to distinct biases. Per-sample Adam Bias[0] investigates the role of batch size by contrasting per-sample updates with standard mini-batch averaging, revealing that the implicit regularization can depend critically on how gradient statistics are aggregated. This line of inquiry sits alongside Adam Duality Theory[3], which examines structural properties of Adam's update rule, and complements broader studies like Implicit Hyperparameter Regularization[2] that consider how optimizer hyperparameters themselves act as implicit regularizers. Meanwhile, Spectral Descent Muon[1] offers a contrasting perspective by analyzing momentum methods through a spectral lens. Together, these works highlight an open question: how do the interplay of momentum, adaptivity, and batch-level aggregation shape the inductive bias of modern optimizers on separable data?

Claimed Contributions

Epoch-wise approximation of incremental Adam dynamics

9 retrieved papers

The authors develop a theoretical framework showing that incremental Adam's epoch-wise updates can be approximated by a function of only the current iterate, eliminating dependence on full gradient history. This approximation becomes a key analytical tool for studying mini-batch Adam's implicit bias.

9 retrieved papers

Proof of l2-max-margin convergence on Scaled Rademacher data

6 retrieved papers

The authors construct a family of structured datasets (Scaled Rademacher data) and prove that incremental Adam converges to the l2-max-margin classifier on these datasets, contrasting sharply with full-batch Adam's l∞-max-margin bias. This demonstrates that mini-batch Adam's implicit bias fundamentally differs from the full-batch regime.

6 retrieved papers

Data-adaptive margin-maximization characterization via fixed-point formulation

4 retrieved papers

The authors introduce a uniform-averaging proxy algorithm for the limit β2 → 1 and characterize its convergence direction through a novel parametric optimization problem combined with a data-dependent dual fixed-point formulation. This framework reveals that mini-batch Adam's implicit bias is intrinsically data-dependent, reducing to standard l2- or l∞-max-margin classifiers on specific datasets.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Understanding Adam through the Lens of Duality: A Unified Theory of Normalized Gradient Methods PDF

DHH Son, M Telgarsky, Z Wang (0)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Epoch-wise approximation of incremental Adam dynamics

[4] Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models PDF

Cannot Refute

[5] Continuous time analysis of momentum methods PDF

Cannot Refute

[6] MomentumSMoe: Integrating momentum into sparse mixture of experts PDF

Cannot Refute

[7] Geometrical structures of digital fluctuations in parameter space of neural networks trained with adaptive momentum optimization PDF

Cannot Refute

[8] SLAMB: Accelerated large batch training with sparse communication PDF

Cannot Refute

[9] Learning dynamics of gradient descent optimization in deep neural networks PDF

Cannot Refute

[10] Fair Power Allocation in NOMA Systems: BiLSTM-Based Hyperparameter Optimization PDF

Cannot Refute

[11] Stochastic Gradient Methods with Bias and Momentum PDF

Cannot Refute

[12] Detection of Chest X-ray Abnormalities Using CNN Based on Hyperparameters Optimization. Eng. Proc. 2023, 52, 0 PDF

Cannot Refute

Contribution

Proof of l2-max-margin convergence on Scaled Rademacher data

[17] How does the optimizer implicitly bias the model merging loss landscape? PDF

Cannot Refute

[18] The Slingshot Effect: A Late-Stage Optimization Anomaly in Adaptive Gradient Methods PDF

Cannot Refute

[19] A large deviation theory analysis on the implicit bias of sgd PDF

Cannot Refute

[20] Generalized EXTRA stochastic gradient Langevin dynamics PDF

Cannot Refute

[21] Which Minimizer Does My Neural Network Converge To? PDF

Cannot Refute

[22] Gradient-based optimization and implicit regularization over non-convex landscapes PDF

Cannot Refute

Contribution

Data-adaptive margin-maximization characterization via fixed-point formulation

[13] Adaptiveface: Adaptive margin and sampling for face recognition PDF

Cannot Refute

[14] The Implicit Bias of Adam on Separable Data PDF

Cannot Refute

[15] Implicit Optimization Bias of Next-token Prediction in Linear Models PDF

Cannot Refute

[16] The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks PDF

Cannot Refute

Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Understanding Adam through the Lens of Duality: A Unified Theory of Normalized Gradient Methods PDF

Contribution Analysis

Epoch-wise approximation of incremental Adam dynamics

[4] Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models PDF

[5] Continuous time analysis of momentum methods PDF

[6] MomentumSMoe: Integrating momentum into sparse mixture of experts PDF

[7] Geometrical structures of digital fluctuations in parameter space of neural networks trained with adaptive momentum optimization PDF

[8] SLAMB: Accelerated large batch training with sparse communication PDF

[9] Learning dynamics of gradient descent optimization in deep neural networks PDF

[10] Fair Power Allocation in NOMA Systems: BiLSTM-Based Hyperparameter Optimization PDF

[11] Stochastic Gradient Methods with Bias and Momentum PDF

[12] Detection of Chest X-ray Abnormalities Using CNN Based on Hyperparameters Optimization. Eng. Proc. 2023, 52, 0 PDF

Proof of l2-max-margin convergence on Scaled Rademacher data

[17] How does the optimizer implicitly bias the model merging loss landscape? PDF

[18] The Slingshot Effect: A Late-Stage Optimization Anomaly in Adaptive Gradient Methods PDF

[19] A large deviation theory analysis on the implicit bias of sgd PDF

[20] Generalized EXTRA stochastic gradient Langevin dynamics PDF

[21] Which Minimizer Does My Neural Network Converge To? PDF

[22] Gradient-based optimization and implicit regularization over non-convex landscapes PDF

Data-adaptive margin-maximization characterization via fixed-point formulation

[13] Adaptiveface: Adaptive margin and sampling for face recognition PDF

[14] The Implicit Bias of Adam on Separable Data PDF

[15] Implicit Optimization Bias of Next-token Prediction in Linear Models PDF

[16] The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks PDF

Table of Contents