ON THE ROLE OF IMPLICIT REGULARIZATION OF STOCHASTIC GRADIENT DESCENT IN GROUP ROBUSTNESS

ICLR 2026 Conference SubmissionAnonymous Authors
Spurious CorrelationsStochastic Gradient Descent (SGD)Implicit Regularization
Abstract:

Training with stochastic gradient descent (SGD) at moderately large learning rates has been observed to improve robustness against spurious correlations, strong correlation between non-predictive features and target labels. Yet, the mechanism underlying this effect remains unclear. In this work, we identify batch size as an additional critical factor and show that robustness gains arise from the implicit regularization of SGD, which intensifies with larger learning rates and smaller batch sizes. This implicit regularization reduces reliance on spurious or shortcut features, thereby enhancing robustness while preserving accuracy. Importantly, this effect appears unique to SGD: gradient descent (GD) does not confer the same benefit and may even exacerbate shortcut reliance. Theoretically, we establish this phenomenon in linear models by leveraging statistical formulations of spurious correlations, proving that SGD systematically suppresses spurious feature dependence. Empirically, we demonstrate that the effect extends to deep neural networks across multiple benchmarks. For the experiments and codes, please refer to this \href{https://github.com/ICLR2026-submission/implicit-regularization-in-group-robustness}{GitHub repository}.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates how SGD's implicit regularization affects robustness to spurious correlations, identifying batch size and learning rate as critical factors. It resides in the 'Core SGD Implicit Regularization in Group Robustness' leaf under 'Empirical Studies and Benchmarking', which contains only this paper as a sibling. This positioning suggests a relatively sparse research direction focused specifically on empirical validation of SGD's implicit bias effects on group-structured robustness, rather than broader theoretical foundations or mitigation strategies covered in neighboring branches.

The taxonomy reveals substantial activity in related areas: 'Simplicity Bias Mechanisms and Theoretical Foundations' contains multiple papers analyzing gradient dynamics and implicit regularization theory, while 'Optimizer Comparisons' explores alternatives like SAM and Adam. The paper's leaf sits adjacent to 'Simplified Models and Theoretical Testbeds' and 'Workshops and Broad Surveys', indicating it bridges empirical benchmarking with theoretical insights. Unlike purely theoretical work in sibling branches or domain-specific applications in graph learning or reinforcement learning, this work emphasizes controlled empirical validation of core SGD properties across standard benchmarks.

Among twenty-one candidates examined, the batch size contribution shows one refutable candidate, suggesting some prior recognition of batch size effects on robustness. However, the theoretical characterization of SGD versus GD effects examined ten candidates with none clearly refuting the contribution, indicating potential novelty in contrasting these optimizers' impacts on spurious features. The empirical validation across benchmarks similarly examined ten candidates without clear refutation, though the limited search scope means comprehensive prior work may exist beyond the top-K semantic matches analyzed here.

Based on the limited literature search, the work appears to occupy a moderately explored niche. The batch size insight has some precedent, while the theoretical and empirical contributions show less direct overlap among examined candidates. The sparse population of its taxonomy leaf and the focused scope of related work suggest incremental advancement rather than paradigm shift, though the analysis covers only a subset of potentially relevant literature.

Taxonomy

Core-task Taxonomy Papers
32
3
Claimed Contributions
21
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: implicit regularization effects on spurious feature learning in stochastic gradient descent. The field examines how SGD's inherent biases shape which features neural networks learn, particularly when spurious correlations are present. The taxonomy reveals several complementary perspectives: theoretical foundations explore simplicity bias mechanisms that explain why SGD favors certain features over others (e.g., Gradient starvation[1], The pitfalls of simplicity[3]); empirical studies and benchmarking efforts document these phenomena across diverse settings; spurious correlation detection work identifies when models rely on shortcuts; optimizer comparisons contrast SGD with alternatives like SAM or mirror descent; noise and stochasticity analyses probe how batch size and gradient noise influence feature selection; mitigation strategies propose interventions to counteract harmful biases; and domain-specific applications test robustness in real-world contexts. These branches collectively map how algorithmic choices, data properties, and training dynamics interact to determine whether networks learn robust or spurious patterns. Recent work has intensified focus on understanding the temporal dynamics of feature learning and developing early detection methods. Studies like Simplicity bias leads to[5] and Evading the simplicity bias[8] investigate how simplicity preferences emerge during training, while Identifying spurious biases early[6] and Feature Contamination[10] explore detection mechanisms. ON THE ROLE OF[0] sits within the empirical benchmarking cluster examining core SGD implicit regularization effects on group robustness, closely aligned with works like Implicit Regularization Effects of[4] and The implicit bias of[12] that dissect how optimization dynamics affect worst-group performance. Compared to purely theoretical analyses such as Simplicity Bias via Global[7], ON THE ROLE OF[0] emphasizes empirical validation of how SGD's implicit biases manifest in group-structured data, bridging mechanistic understanding with practical robustness concerns that motivate mitigation strategies like DR3[15] and DIVE[13].

Claimed Contributions

Identification of batch size as critical factor for group robustness via implicit regularization

The authors identify batch size, alongside learning rate, as a key factor influencing group robustness. They demonstrate that SGD's implicit regularization, which strengthens with larger learning rates and smaller batch sizes, reduces reliance on spurious features and enhances robustness while maintaining accuracy.

1 retrieved paper
Can Refute
Theoretical characterization of SGD and GD effects on spurious feature reliance in linear models

The authors provide theoretical analysis in linear models showing that SGD systematically suppresses dependence on spurious features through its implicit regularization mechanism, while GD does not confer the same benefit and may even increase shortcut reliance.

10 retrieved papers
Empirical validation on deep neural networks across multiple benchmarks

The authors empirically validate their theoretical findings by demonstrating that the robustness-enhancing effects of SGD's implicit regularization extend beyond linear models to deep neural networks across various benchmark datasets.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification of batch size as critical factor for group robustness via implicit regularization

The authors identify batch size, alongside learning rate, as a key factor influencing group robustness. They demonstrate that SGD's implicit regularization, which strengthens with larger learning rates and smaller batch sizes, reduces reliance on spurious features and enhances robustness while maintaining accuracy.

Contribution

Theoretical characterization of SGD and GD effects on spurious feature reliance in linear models

The authors provide theoretical analysis in linear models showing that SGD systematically suppresses dependence on spurious features through its implicit regularization mechanism, while GD does not confer the same benefit and may even increase shortcut reliance.

Contribution

Empirical validation on deep neural networks across multiple benchmarks

The authors empirically validate their theoretical findings by demonstrating that the robustness-enhancing effects of SGD's implicit regularization extend beyond linear models to deep neural networks across various benchmark datasets.