Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime

ICLR 2026 Conference SubmissionAnonymous Authors
Adamimplicit biasseparable dataadaptive algorithmsmini-batch
Abstract:

Adam [Kingma & Ba, 2015] is the de facto optimizer in deep learning, yet its theoretical understanding remains limited. Prior analyses show that Adam favors solutions aligned with \ell_\infty-geometry, but these results are restricted to the full-batch regime. In this work, we study the implicit bias of incremental Adam (using one sample per step) for logistic regression on linearly separable data, and show that its bias can deviate from the full-batch behavior. As an extreme example, we construct datasets on which incremental Adam provably converges to the 2\ell_2-max-margin classifier, in contrast to the \ell_\infty-max-margin bias of full-batch Adam. For general datasets, we characterize its bias using a proxy algorithm for the β21\beta_2 \to 1 limit. This proxy maximizes a data-adaptive Mahalanobis-norm margin, whose associated covariance matrix is determined by a data-dependent dual fixed-point formulation. We further present concrete datasets where this bias reduces to the standard 2\ell_2- and \ell_\infty-max-margin classifiers. As a counterpoint, we prove that Signum [Bernstein et al., 2018] converges to the \ell_\infty-max-margin classifier for any batch size. Overall, our results highlight that the implicit bias of Adam crucially depends on both the batching scheme and the dataset, while Signum remains invariant.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper characterizes the implicit bias of incremental Adam (one sample per step) on linearly separable logistic regression, showing that batch size fundamentally alters the solution geometry. It resides in the 'Adam Variants and Batch-Size Dependence' leaf, which contains only two papers total. This sparse taxonomy leaf suggests the specific question of how Adam's bias varies with batching schemes remains relatively unexplored, positioning the work in a niche but emerging research direction within the broader study of adaptive gradient methods.

The taxonomy reveals two main branches: Adaptive Gradient Methods with Momentum (where this paper sits) and Normalized Gradient Descent Methods. The sibling leaf 'Hyperparameter-Induced Implicit Regularization' examines how optimizer hyperparameters act as regularizers, while the neighboring 'Spectral and Momentum Steepest Descent' branch analyzes normalized methods converging to p-norm margins. The paper's focus on batch-size dependence distinguishes it from hyperparameter-centric analyses and connects to the broader question of how aggregation schemes (per-sample vs. mini-batch) interact with adaptive learning rates to shape solution geometry.

Among nineteen candidates examined, none clearly refute the three main contributions. The epoch-wise approximation framework (nine candidates examined, zero refutable) and the Scaled Rademacher construction proving ℓ₂-max-margin convergence (six candidates, zero refutable) appear novel within this limited search scope. The data-adaptive Mahalanobis-norm characterization via fixed-point formulation (four candidates examined) also shows no direct prior work among the candidates reviewed. The absence of refutable pairs suggests these technical contributions extend beyond what the top-K semantic matches and citation expansion captured, though this does not preclude relevant work outside the search scope.

Based on the limited literature search covering nineteen candidates, the work appears to introduce new theoretical machinery for understanding incremental Adam's bias. The sparse taxonomy leaf and zero refutable pairs indicate novelty within the examined scope, though the small candidate pool and narrow semantic search radius mean potentially relevant work in adjacent optimization theory or stochastic approximation may not have been surfaced.

Taxonomy

Core-task Taxonomy Papers
3
3
Claimed Contributions
19
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: implicit bias of mini-batch Adam on separable data. The field structure divides into two main branches. The first, Adaptive Gradient Methods with Momentum, encompasses works that study how momentum-based optimizers like Adam behave in terms of their implicit regularization, particularly examining how batch size and per-sample versus mini-batch updates influence the solutions found. The second branch, Normalized Gradient Descent Methods, focuses on algorithms that normalize gradients in various ways, often revealing connections to margin maximization and directional convergence. These branches share a common interest in understanding what kind of solutions emerge when optimization is run to convergence on separable or overparameterized problems, yet they differ in whether momentum and adaptive learning rates or normalization mechanisms drive the implicit bias. Within the Adaptive Gradient Methods with Momentum branch, recent work has explored how different formulations of Adam lead to distinct biases. Per-sample Adam Bias[0] investigates the role of batch size by contrasting per-sample updates with standard mini-batch averaging, revealing that the implicit regularization can depend critically on how gradient statistics are aggregated. This line of inquiry sits alongside Adam Duality Theory[3], which examines structural properties of Adam's update rule, and complements broader studies like Implicit Hyperparameter Regularization[2] that consider how optimizer hyperparameters themselves act as implicit regularizers. Meanwhile, Spectral Descent Muon[1] offers a contrasting perspective by analyzing momentum methods through a spectral lens. Together, these works highlight an open question: how do the interplay of momentum, adaptivity, and batch-level aggregation shape the inductive bias of modern optimizers on separable data?

Claimed Contributions

Epoch-wise approximation of incremental Adam dynamics

The authors develop a theoretical framework showing that incremental Adam's epoch-wise updates can be approximated by a function of only the current iterate, eliminating dependence on full gradient history. This approximation becomes a key analytical tool for studying mini-batch Adam's implicit bias.

9 retrieved papers
Proof of l2-max-margin convergence on Scaled Rademacher data

The authors construct a family of structured datasets (Scaled Rademacher data) and prove that incremental Adam converges to the l2-max-margin classifier on these datasets, contrasting sharply with full-batch Adam's l∞-max-margin bias. This demonstrates that mini-batch Adam's implicit bias fundamentally differs from the full-batch regime.

6 retrieved papers
Data-adaptive margin-maximization characterization via fixed-point formulation

The authors introduce a uniform-averaging proxy algorithm for the limit β2 → 1 and characterize its convergence direction through a novel parametric optimization problem combined with a data-dependent dual fixed-point formulation. This framework reveals that mini-batch Adam's implicit bias is intrinsically data-dependent, reducing to standard l2- or l∞-max-margin classifiers on specific datasets.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Epoch-wise approximation of incremental Adam dynamics

The authors develop a theoretical framework showing that incremental Adam's epoch-wise updates can be approximated by a function of only the current iterate, eliminating dependence on full gradient history. This approximation becomes a key analytical tool for studying mini-batch Adam's implicit bias.

Contribution

Proof of l2-max-margin convergence on Scaled Rademacher data

The authors construct a family of structured datasets (Scaled Rademacher data) and prove that incremental Adam converges to the l2-max-margin classifier on these datasets, contrasting sharply with full-batch Adam's l∞-max-margin bias. This demonstrates that mini-batch Adam's implicit bias fundamentally differs from the full-batch regime.

Contribution

Data-adaptive margin-maximization characterization via fixed-point formulation

The authors introduce a uniform-averaging proxy algorithm for the limit β2 → 1 and characterize its convergence direction through a novel parametric optimization problem combined with a data-dependent dual fixed-point formulation. This framework reveals that mini-batch Adam's implicit bias is intrinsically data-dependent, reducing to standard l2- or l∞-max-margin classifiers on specific datasets.