On The Surprising Effectiveness of a Single Global Merging in Decentralized Learning

ICLR 2026 Conference SubmissionAnonymous Authors
Decentralized LearningModel Merging
Abstract:

Decentralized learning provides a scalable alternative to parameter-server-based training, yet its performance is often hindered by limited peer-to-peer communication. In this paper, we study how communication should be scheduled over time to improve global generalization, including determining when and how frequently devices synchronize. Counterintuitive empirical results show that concentrating communication budgets in the later stages of decentralized training remarkably improves global generalization. Surprisingly, we uncover that fully connected communication at the final step, implemented by a single global merging, can significant improve the generalization performance of decentralized learning under serve high data heterogeneity. Our theoretical contributions, which explains these phenomena, are first to establish that the globally merged model of decentralized SGD can match the convergence rate of parallel SGD. Technically, we reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components essential for matching this rate. This work provides promising results that decentralized learning is able to generalize under high data heterogeneity and limited communication, while offering broad new avenues for model merging research. The code will be made publicly available.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes concentrating communication budgets in later training stages, culminating in a single global merging step, to improve generalization in decentralized learning under severe data heterogeneity. It resides in the Temporal Communication Scheduling leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy of 49 papers across 19 leaf nodes. This leaf focuses specifically on optimal timing and frequency of synchronization events, distinguishing it from asynchronous methods or adaptive topology approaches that populate neighboring branches.

The taxonomy reveals that Temporal Communication Scheduling sits alongside Asynchronous Communication Approaches and Adaptive Peer Selection within the Communication Scheduling and Synchronization Strategies branch. Neighboring branches address orthogonal concerns: Gradient Compression reduces payload sizes, Efficient Aggregation Protocols optimize merging mechanics, and Bandwidth-Constrained Learning tackles resource limits. The paper's emphasis on when to communicate rather than how to compress or which peers to select places it squarely in the temporal scheduling domain, though its global merging strategy shares conceptual overlap with aggregation protocols that coordinate distributed updates.

Among 29 candidates examined, the theoretical convergence analysis matching parallel SGD rates encountered two refutable candidates from 10 examined, suggesting moderate prior work in this specific theoretical claim. The empirical demonstration of single global merging effectiveness and the theoretical explanation for temporal communication allocation each examined 10 and 9 candidates respectively, with zero refutable matches, indicating these contributions appear more novel within the limited search scope. The statistics reflect a focused literature search rather than exhaustive coverage, so unexamined work may exist beyond the top-K semantic matches and citation expansions performed.

Given the sparse Temporal Communication Scheduling leaf and the limited refutation evidence across most contributions, the work appears to occupy a relatively underexplored niche within decentralized learning. The theoretical convergence claim shows some overlap with prior analysis, but the empirical focus on late-stage global merging and the reinterpretation of local model discrepancy as constructive rather than detrimental noise seem less directly addressed in the examined candidates. These impressions are bounded by the 29-paper search scope and may shift with broader literature coverage.

Taxonomy

Core-task Taxonomy Papers
49
3
Claimed Contributions
29
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: communication scheduling in decentralized learning under limited peer-to-peer communication. The field addresses how distributed agents can collaboratively train models when direct communication is constrained by bandwidth, latency, or energy budgets. The taxonomy reveals several complementary research directions: Communication Scheduling and Synchronization Strategies explore when and how nodes exchange updates, including temporal scheduling approaches like Global Merging Decentralized[0] and Scheduling Communication Schemes[34]; Communication Efficiency via Model Compression and Sparsification reduces payload sizes through techniques such as Deep Gradient Compression[44] and SparSFA[11]; Parameter Aggregation and Synchronization Mechanisms design protocols for merging distributed updates, exemplified by Turbo Aggregate[15] and Asynchronous Parameter Sharing[13]; Resource-Aware Decentralized Learning optimizes for heterogeneous compute and energy constraints, as seen in Clustered Energy Harvesting[19] and Resource Constrained Edge[25]; while Convergence, Optimization, and Theoretical Analysis provide formal guarantees, and Application-Specific branches tailor methods to domains like vehicular networks (Vehicular Distributed Learning[46]) or IoT (Client Scheduling IoT[6]). A central tension emerges between synchronous schemes that ensure consistency but suffer from stragglers, and asynchronous or event-driven approaches that improve throughput at the cost of staleness. Works like Asynchronous Communication Acceleration[14] and Snake Learning[4] explore dynamic topologies and flexible timing, while Training Workload Balancing[5] and Selective Multicast Synchronization[3] address load imbalance and targeted communication. Global Merging Decentralized[0] sits within the Temporal Communication Scheduling branch, emphasizing periodic global synchronization phases to balance convergence speed and communication overhead. Compared to purely asynchronous methods like Asynchronous Communication Learning[31], it likely enforces stricter coordination intervals, and relative to fine-grained scheduling in Scheduling Communication Schemes[34], it may favor coarser global merge events. This positioning reflects a pragmatic middle ground: leveraging temporal structure to reduce redundant transmissions while maintaining sufficient alignment across peers for stable convergence in bandwidth-limited settings.

Claimed Contributions

Empirical demonstration of single global merging effectiveness

The authors empirically demonstrate that performing a single global merging (parameter averaging) at the final training step significantly improves global generalization in decentralized learning, even under extremely limited communication budgets and high data heterogeneity. This finding holds across diverse experimental settings including different datasets, architectures, and optimizers.

10 retrieved papers
Theoretical convergence analysis matching parallel SGD rate

The authors establish the first theoretical result proving that the globally merged model from decentralized SGD can achieve the same convergence rate as parallel SGD. They reinterpret part of the model discrepancy among local models as constructive components rather than purely detrimental noise, enabling this rate matching.

10 retrieved papers
Can Refute
Theoretical explanation for temporal communication allocation

The authors provide theoretical justification showing why minimal but non-zero communication preserves model mergeability throughout training, and formally explain why allocating communication budgets toward later training stages improves performance. This is formalized through conditions on consensus violation and gradient norms.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Empirical demonstration of single global merging effectiveness

The authors empirically demonstrate that performing a single global merging (parameter averaging) at the final training step significantly improves global generalization in decentralized learning, even under extremely limited communication budgets and high data heterogeneity. This finding holds across diverse experimental settings including different datasets, architectures, and optimizers.

Contribution

Theoretical convergence analysis matching parallel SGD rate

The authors establish the first theoretical result proving that the globally merged model from decentralized SGD can achieve the same convergence rate as parallel SGD. They reinterpret part of the model discrepancy among local models as constructive components rather than purely detrimental noise, enabling this rate matching.

Contribution

Theoretical explanation for temporal communication allocation

The authors provide theoretical justification showing why minimal but non-zero communication preserves model mergeability throughout training, and formally explain why allocating communication budgets toward later training stages improves performance. This is formalized through conditions on consensus violation and gradient norms.