On The Surprising Effectiveness of a Single Global Merging in Decentralized Learning
Overview
Overall Novelty Assessment
The paper proposes concentrating communication budgets in later training stages, culminating in a single global merging step, to improve generalization in decentralized learning under severe data heterogeneity. It resides in the Temporal Communication Scheduling leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy of 49 papers across 19 leaf nodes. This leaf focuses specifically on optimal timing and frequency of synchronization events, distinguishing it from asynchronous methods or adaptive topology approaches that populate neighboring branches.
The taxonomy reveals that Temporal Communication Scheduling sits alongside Asynchronous Communication Approaches and Adaptive Peer Selection within the Communication Scheduling and Synchronization Strategies branch. Neighboring branches address orthogonal concerns: Gradient Compression reduces payload sizes, Efficient Aggregation Protocols optimize merging mechanics, and Bandwidth-Constrained Learning tackles resource limits. The paper's emphasis on when to communicate rather than how to compress or which peers to select places it squarely in the temporal scheduling domain, though its global merging strategy shares conceptual overlap with aggregation protocols that coordinate distributed updates.
Among 29 candidates examined, the theoretical convergence analysis matching parallel SGD rates encountered two refutable candidates from 10 examined, suggesting moderate prior work in this specific theoretical claim. The empirical demonstration of single global merging effectiveness and the theoretical explanation for temporal communication allocation each examined 10 and 9 candidates respectively, with zero refutable matches, indicating these contributions appear more novel within the limited search scope. The statistics reflect a focused literature search rather than exhaustive coverage, so unexamined work may exist beyond the top-K semantic matches and citation expansions performed.
Given the sparse Temporal Communication Scheduling leaf and the limited refutation evidence across most contributions, the work appears to occupy a relatively underexplored niche within decentralized learning. The theoretical convergence claim shows some overlap with prior analysis, but the empirical focus on late-stage global merging and the reinterpretation of local model discrepancy as constructive rather than detrimental noise seem less directly addressed in the examined candidates. These impressions are bounded by the 29-paper search scope and may shift with broader literature coverage.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors empirically demonstrate that performing a single global merging (parameter averaging) at the final training step significantly improves global generalization in decentralized learning, even under extremely limited communication budgets and high data heterogeneity. This finding holds across diverse experimental settings including different datasets, architectures, and optimizers.
The authors establish the first theoretical result proving that the globally merged model from decentralized SGD can achieve the same convergence rate as parallel SGD. They reinterpret part of the model discrepancy among local models as constructive components rather than purely detrimental noise, enabling this rate matching.
The authors provide theoretical justification showing why minimal but non-zero communication preserves model mergeability throughout training, and formally explain why allocating communication budgets toward later training stages improves performance. This is formalized through conditions on consensus violation and gradient norms.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Empirical demonstration of single global merging effectiveness
The authors empirically demonstrate that performing a single global merging (parameter averaging) at the final training step significantly improves global generalization in decentralized learning, even under extremely limited communication budgets and high data heterogeneity. This finding holds across diverse experimental settings including different datasets, architectures, and optimizers.
[50] One-shot federated learning for LEO constellations that reduces convergence time from days to 90 minutes PDF
[51] Communication-Efficient Distributed Deep Learning via Federated Dynamic Averaging PDF
[52] K-DUMBs IoRT: Knowledge driven unified model block sharing in the Internet of Robotic Things PDF
[53] Multi-Device Cooperative Fine-Tuning of Foundation Models at the Network Edge PDF
[54] OSGAN: One-shot distributed learning using generative adversarial networks: A. Kasturi, C. Hota PDF
[55] Optimizing quantum federated learning: addressing non-I I D data challenges with global data sharing in weighted model averaging and clustering-based parameter selection PDF
[56] One-shot federated learning-based model-free reinforcement learning PDF
[57] MOHFL: Multi-Level One-Shot Hierarchical Federated Learning With Enhanced Model Aggregation Over Non-IID Data PDF
[58] Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices PDF
[59] DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models PDF
Theoretical convergence analysis matching parallel SGD rate
The authors establish the first theoretical result proving that the globally merged model from decentralized SGD can achieve the same convergence rate as parallel SGD. They reinterpret part of the model discrepancy among local models as constructive components rather than purely detrimental noise, enabling this rate matching.
[60] Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent PDF
[69] Asymptotic network independence in distributed stochastic optimization for machine learning PDF
[61] CEDAS: A Compressed Decentralized Stochastic Gradient Method With Improved Convergence PDF
[62] Does worst-performing agent lead the pack? analyzing agent dynamics in unified distributed sgd PDF
[63] A(DP)SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent With Differential Privacy PDF
[64] An improved convergence analysis for decentralized online stochastic non-convex optimization PDF
[65] Decentralized asynchronous nonconvex stochastic optimization on directed graphs PDF
[66] Tackling Data Heterogeneity: A New Unified Framework for Decentralized SGD with Sample-induced Topology PDF
[67] Improving the transient times for distributed stochastic gradient methods PDF
[68] Dâ(DP)2SGD: Decentralized Parallel SGD with Differential Privacy in Dynamic Networks PDF
Theoretical explanation for temporal communication allocation
The authors provide theoretical justification showing why minimal but non-zero communication preserves model mergeability throughout training, and formally explain why allocating communication budgets toward later training stages improves performance. This is formalized through conditions on consensus violation and gradient norms.