Adaptive Regularization for Large-Scale Sparse Feature Embedding Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

adaptive regularizationCTR estimationlarge-scale sparse featureoptimizationone-epoch overfitting

The one-epoch overfitting problem has drawn widespread attention, especially in CTR and CVR estimation models in search, advertising, and recommendation domains. These models which rely heavily on large-scale sparse categorical features, often suffer a significant decline in performance when trained for multiple epochs. Although recent studies have proposed heuristic solutions, the fundamental cause of this phenomenon remains unclear. In this work, we present a theoretical explanation grounded in Rademacher complexity, supported by empirical experiments, to explain why overfitting occurs in models with large-scale sparse categorical features. Based on this analysis, we propose a regularization method that constrains the norm budget of embedding layers adaptively. Our approach not only prevents the severe performance degradation observed during multi-epoch training, but also improves model performance within a single epoch. This method has already been deployed in online production systems.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a Rademacher complexity-based theoretical explanation for one-epoch overfitting in sparse embedding models, alongside an adaptive regularization method that constrains embedding norm budgets. It resides in the 'Complexity-Based Theoretical Explanations' leaf under 'Theoretical Analysis and Understanding', sharing this leaf with only one sibling paper. This represents a relatively sparse research direction within the broader taxonomy of 23 papers across the field, suggesting that formal theoretical frameworks for understanding one-epoch overfitting remain underdeveloped compared to empirical or application-focused work.

The taxonomy reveals neighboring work in 'Empirical Characterization of One-Epoch Phenomena' (documenting overfitting behavior without formal theory) and multiple regularization branches including 'Multi-Epoch Training with Data Augmentation' and 'Sparse Regularization and Structured Sparsity'. The paper bridges theoretical understanding and practical mitigation, connecting the sparse theoretical branch to the more populated regularization strategies. Its scope excludes purely empirical characterization or architecture-level solutions, instead grounding regularization design in complexity theory—a boundary that distinguishes it from sibling work in adjacent leaves focused on training tricks or structural innovations.

Among 30 candidates examined, the theoretical explanation via Rademacher complexity (Contribution 1) and the constrained optimization formalization (Contribution 3) each examined 10 candidates with zero refutations, suggesting these angles are relatively unexplored in the limited search scope. The adaptive regularization method (Contribution 2) examined 10 candidates and found 1 refutable match, indicating some prior work on adaptive embedding regularization exists. The statistics suggest the theoretical framing is more novel than the regularization technique itself, though the search scale is modest and does not cover the full literature landscape.

Based on top-30 semantic matches, the work appears to occupy a niche intersection of theory and practice in a field dominated by empirical solutions. The theoretical contributions show no overlap in the examined candidates, while the regularization method has limited prior work. However, the analysis does not capture potential overlaps in broader machine learning theory or optimization literature beyond the sparse embedding domain, leaving open questions about novelty relative to general complexity-based regularization research.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Mitigating one-epoch overfitting in sparse feature embedding models. The field addresses a critical challenge in large-scale recommendation and ranking systems, where models trained on massive datasets often see each example only once, leading to overfitting on sparse categorical features. The taxonomy organizes research into five main branches: Theoretical Analysis and Understanding explores why one-epoch overfitting occurs and how model complexity interacts with sparse embeddings, as seen in Understanding CTR Overfitting[4]; Regularization and Training Strategies develops techniques to stabilize learning through modified optimization and data augmentation; Architecture and Model Design proposes novel embedding structures like Mixed Dimension Embeddings[14] and memory-efficient representations such as Memory Efficient Large Output[12]; Domain-Specific Applications and Adaptations tailors solutions to particular recommendation contexts; and System-Level Optimization and Fault Tolerance addresses practical deployment concerns including distributed training robustness, exemplified by Failure Tolerant Training[9]. Recent work has concentrated on understanding the theoretical underpinnings of one-epoch overfitting and developing practical mitigation strategies. One Epoch Training[8] and Taming One-Epoch[18] represent active lines investigating training dynamics and regularization approaches, while Clustering the Sketch[1] explores dimensionality reduction techniques for sparse features. Adaptive Sparse Embedding[0] sits within the Theoretical Analysis branch, specifically focusing on complexity-based explanations for overfitting behavior. Its emphasis on theoretical understanding distinguishes it from more empirically-driven regularization methods like Multi-Epoch Data Augmentation[22], yet complements practical architectural innovations such as DSparsE[19]. The work addresses fundamental questions about why sparse embeddings are particularly vulnerable to single-pass training, contributing to a growing body of theory that bridges the gap between observed overfitting phenomena and principled solutions grounded in learning theory.

Claimed Contributions

Theoretical explanation of one-epoch overfitting via Rademacher complexity

10 retrieved papers

The authors provide a theoretical framework using Rademacher complexity bounds to explain why models with large-scale sparse categorical features suffer from one-epoch overfitting. They show that unconstrained embedding norm growth leads to looser generalization bounds, particularly affecting sparse features.

10 retrieved papers

Adaptive regularization method based on feature occurrence intervals

Can Refute

10 retrieved papers

The authors introduce an adaptive regularization approach that dynamically adjusts regularization strength for each embedding vector based on its occurrence interval during training. This method allocates smaller norm budgets to low-frequency features while allowing larger budgets for high-frequency features, implemented through modified optimizer update rules.

10 retrieved papers

Can Refute

Formalization of optimal regularization as constrained optimization problem

10 retrieved papers

The authors formalize the balance between training error and generalization as a constrained optimization problem over embedding norm budgets. They derive that optimal regularization multipliers should be inversely proportional to feature sample frequency, providing theoretical justification for their adaptive approach.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[8] Why you don't overfit, and don't need Bayes if you only train for one epoch PDF

Aitchison, Laurence, Laurence Aitchison (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical explanation of one-epoch overfitting via Rademacher complexity

[34] Risk Bounds and Rademacher Complexity in Batch Reinforcement Learning PDF

Cannot Refute

[35] Norm-based Generalization Bounds for Compositionally Sparse Neural Networks PDF

Cannot Refute

[36] Sequence Length Independent Norm-Based Generalization Bounds for Transformers PDF

Cannot Refute

[37] Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature PDF

Cannot Refute

[38] A general framework for scalable transductive transfer learning PDF

Cannot Refute

[39] Huber-norm regularization for linear prediction models PDF

Cannot Refute

[40] Norm-Based Generalization Bounds for Compositionally Sparse Neural Network PDF

Cannot Refute

[41] Lecture 5: Rademacher Complexity PDF

Cannot Refute

[42] Metric Learning for Categorical and Ambiguous Features: An Adversarial Approach PDF

Cannot Refute

[43] Improving Learning of Deep Neural Networks through Convexification and Kernel Methods PDF

Cannot Refute

Contribution

Adaptive regularization method based on feature occurrence intervals

[49] Difacto: Distributed factorization machines PDF

Can Refute

[44] Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers PDF

Cannot Refute

[45] LAE-Net: A locally-adaptive embedding network for low-light image enhancement PDF

Cannot Refute

[46] Graph Convolutional Networks with Adaptive Frequency and Dynamic Node Embedding PDF

Cannot Refute

[47] Adaptive Graph Embedding with Consistency and Specificity for Domain Adaptation PDF

Cannot Refute

[48] Efficient vector representation for documents through corruption PDF

Cannot Refute

[50] Leveraging Explainability, Distribution Learning, and Frequency Regularization in Multimedia Applications PDF

Cannot Refute

[51] Frequency Embedded Regularization Network for Continuous Music Emotion Recognition PDF

Cannot Refute

[52] Task-adaptive Pre-training of Language Models with Word Embedding Regularization PDF

Cannot Refute

[53] Discovering Multi-Frequency Embedding for Visible-Infrared Person Re-identification PDF

Cannot Refute

Contribution

Formalization of optimal regularization as constrained optimization problem

[24] Deep graph representation learning and optimization for influence maximization PDF

Cannot Refute

[25] Old Optimizer, New Norm: An Anthology PDF

Cannot Refute

[26] Semantically Constrained Memory Allocation (SCMA) for Embedding in Efficient Recommendation Systems PDF

Cannot Refute

[27] Personalized elastic embedding learning for on-device recommendation PDF

Cannot Refute

[28] Cafe: Towards compact, adaptive, and fast embedding for large-scale recommendation models PDF

Cannot Refute

[29] Dp-merf: Differentially private mean embeddings with randomfeatures for practical privacy-preserving data generation PDF

Cannot Refute

[30] Budgeted Embedding Table For Recommender Systems PDF

Cannot Refute

[31] Learning elastic embeddings for customizing on-device recommenders PDF

Cannot Refute

[32] Coupling elephant herding with ordinal optimization for solving the stochastic inequality constrained optimization problems PDF

Cannot Refute

[33] Do not let privacy overbill utility: Gradient embedding perturbation for private learning PDF

Cannot Refute

Adaptive Regularization for Large-Scale Sparse Feature Embedding Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] Why you don't overfit, and don't need Bayes if you only train for one epoch PDF

Contribution Analysis

Theoretical explanation of one-epoch overfitting via Rademacher complexity

[34] Risk Bounds and Rademacher Complexity in Batch Reinforcement Learning PDF

[35] Norm-based Generalization Bounds for Compositionally Sparse Neural Networks PDF

[36] Sequence Length Independent Norm-Based Generalization Bounds for Transformers PDF

[37] Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature PDF

[38] A general framework for scalable transductive transfer learning PDF

[39] Huber-norm regularization for linear prediction models PDF

[40] Norm-Based Generalization Bounds for Compositionally Sparse Neural Network PDF

[41] Lecture 5: Rademacher Complexity PDF

[42] Metric Learning for Categorical and Ambiguous Features: An Adversarial Approach PDF

[43] Improving Learning of Deep Neural Networks through Convexification and Kernel Methods PDF

Adaptive regularization method based on feature occurrence intervals

[49] Difacto: Distributed factorization machines PDF

[44] Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers PDF

[45] LAE-Net: A locally-adaptive embedding network for low-light image enhancement PDF

[46] Graph Convolutional Networks with Adaptive Frequency and Dynamic Node Embedding PDF

[47] Adaptive Graph Embedding with Consistency and Specificity for Domain Adaptation PDF

[48] Efficient vector representation for documents through corruption PDF

[50] Leveraging Explainability, Distribution Learning, and Frequency Regularization in Multimedia Applications PDF

[51] Frequency Embedded Regularization Network for Continuous Music Emotion Recognition PDF

[52] Task-adaptive Pre-training of Language Models with Word Embedding Regularization PDF

[53] Discovering Multi-Frequency Embedding for Visible-Infrared Person Re-identification PDF

Formalization of optimal regularization as constrained optimization problem

[24] Deep graph representation learning and optimization for influence maximization PDF

[25] Old Optimizer, New Norm: An Anthology PDF

[26] Semantically Constrained Memory Allocation (SCMA) for Embedding in Efficient Recommendation Systems PDF

[27] Personalized elastic embedding learning for on-device recommendation PDF

[28] Cafe: Towards compact, adaptive, and fast embedding for large-scale recommendation models PDF

[29] Dp-merf: Differentially private mean embeddings with randomfeatures for practical privacy-preserving data generation PDF

[30] Budgeted Embedding Table For Recommender Systems PDF

[31] Learning elastic embeddings for customizing on-device recommenders PDF

[32] Coupling elephant herding with ordinal optimization for solving the stochastic inequality constrained optimization problems PDF

[33] Do not let privacy overbill utility: Gradient embedding perturbation for private learning PDF

Table of Contents