On The Surprising Effectiveness of a Single Global Merging in Decentralized Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.5 Download Report PDF

Decentralized LearningModel Merging

Decentralized learning provides a scalable alternative to parameter-server-based training, yet its performance is often hindered by limited peer-to-peer communication. In this paper, we study how communication should be scheduled over time to improve global generalization, including determining when and how frequently devices synchronize. Counterintuitive empirical results show that concentrating communication budgets in the later stages of decentralized training remarkably improves global generalization. Surprisingly, we uncover that fully connected communication at the final step, implemented by a single global merging, can significant improve the generalization performance of decentralized learning under serve high data heterogeneity. Our theoretical contributions, which explains these phenomena, are first to establish that the globally merged model of decentralized SGD can match the convergence rate of parallel SGD. Technically, we reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components essential for matching this rate. This work provides promising results that decentralized learning is able to generalize under high data heterogeneity and limited communication, while offering broad new avenues for model merging research. The code will be made publicly available.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes concentrating communication budgets in later training stages, culminating in a single global merging step, to improve generalization in decentralized learning under severe data heterogeneity. It resides in the Temporal Communication Scheduling leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy of 49 papers across 19 leaf nodes. This leaf focuses specifically on optimal timing and frequency of synchronization events, distinguishing it from asynchronous methods or adaptive topology approaches that populate neighboring branches.

The taxonomy reveals that Temporal Communication Scheduling sits alongside Asynchronous Communication Approaches and Adaptive Peer Selection within the Communication Scheduling and Synchronization Strategies branch. Neighboring branches address orthogonal concerns: Gradient Compression reduces payload sizes, Efficient Aggregation Protocols optimize merging mechanics, and Bandwidth-Constrained Learning tackles resource limits. The paper's emphasis on when to communicate rather than how to compress or which peers to select places it squarely in the temporal scheduling domain, though its global merging strategy shares conceptual overlap with aggregation protocols that coordinate distributed updates.

Among 29 candidates examined, the theoretical convergence analysis matching parallel SGD rates encountered two refutable candidates from 10 examined, suggesting moderate prior work in this specific theoretical claim. The empirical demonstration of single global merging effectiveness and the theoretical explanation for temporal communication allocation each examined 10 and 9 candidates respectively, with zero refutable matches, indicating these contributions appear more novel within the limited search scope. The statistics reflect a focused literature search rather than exhaustive coverage, so unexamined work may exist beyond the top-K semantic matches and citation expansions performed.

Given the sparse Temporal Communication Scheduling leaf and the limited refutation evidence across most contributions, the work appears to occupy a relatively underexplored niche within decentralized learning. The theoretical convergence claim shows some overlap with prior analysis, but the empirical focus on late-stage global merging and the reinterpretation of local model discrepancy as constructive rather than detrimental noise seem less directly addressed in the examined candidates. These impressions are bounded by the 29-paper search scope and may shift with broader literature coverage.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: communication scheduling in decentralized learning under limited peer-to-peer communication. The field addresses how distributed agents can collaboratively train models when direct communication is constrained by bandwidth, latency, or energy budgets. The taxonomy reveals several complementary research directions: Communication Scheduling and Synchronization Strategies explore when and how nodes exchange updates, including temporal scheduling approaches like Global Merging Decentralized[0] and Scheduling Communication Schemes[34]; Communication Efficiency via Model Compression and Sparsification reduces payload sizes through techniques such as Deep Gradient Compression[44] and SparSFA[11]; Parameter Aggregation and Synchronization Mechanisms design protocols for merging distributed updates, exemplified by Turbo Aggregate[15] and Asynchronous Parameter Sharing[13]; Resource-Aware Decentralized Learning optimizes for heterogeneous compute and energy constraints, as seen in Clustered Energy Harvesting[19] and Resource Constrained Edge[25]; while Convergence, Optimization, and Theoretical Analysis provide formal guarantees, and Application-Specific branches tailor methods to domains like vehicular networks (Vehicular Distributed Learning[46]) or IoT (Client Scheduling IoT[6]). A central tension emerges between synchronous schemes that ensure consistency but suffer from stragglers, and asynchronous or event-driven approaches that improve throughput at the cost of staleness. Works like Asynchronous Communication Acceleration[14] and Snake Learning[4] explore dynamic topologies and flexible timing, while Training Workload Balancing[5] and Selective Multicast Synchronization[3] address load imbalance and targeted communication. Global Merging Decentralized[0] sits within the Temporal Communication Scheduling branch, emphasizing periodic global synchronization phases to balance convergence speed and communication overhead. Compared to purely asynchronous methods like Asynchronous Communication Learning[31], it likely enforces stricter coordination intervals, and relative to fine-grained scheduling in Scheduling Communication Schemes[34], it may favor coarser global merge events. This positioning reflects a pragmatic middle ground: leveraging temporal structure to reduce redundant transmissions while maintaining sufficient alignment across peers for stable convergence in bandwidth-limited settings.

Claimed Contributions

Empirical demonstration of single global merging effectiveness

10 retrieved papers

The authors empirically demonstrate that performing a single global merging (parameter averaging) at the final training step significantly improves global generalization in decentralized learning, even under extremely limited communication budgets and high data heterogeneity. This finding holds across diverse experimental settings including different datasets, architectures, and optimizers.

10 retrieved papers

Theoretical convergence analysis matching parallel SGD rate

Can Refute

10 retrieved papers

The authors establish the first theoretical result proving that the globally merged model from decentralized SGD can achieve the same convergence rate as parallel SGD. They reinterpret part of the model discrepancy among local models as constructive components rather than purely detrimental noise, enabling this rate matching.

10 retrieved papers

Can Refute

Theoretical explanation for temporal communication allocation

9 retrieved papers

The authors provide theoretical justification showing why minimal but non-zero communication preserves model mergeability throughout training, and formally explain why allocating communication budgets toward later training stages improves performance. This is formalized through conditions on consensus violation and gradient norms.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[34] Scheduling and Communication Schemes for Decentralized Federated Learning PDF

FernÃ¡ndez Vilas, Ana, FernÃ¡ndez Veiga, Manuel, El-Bendary Nashwa (2023) • arXiv (Cornell University)

[38] Loss Based Byzantine Resilience for Decentralized Learning PDF

Shinnosuke Masuda, Kazuyuki Shudo (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Empirical demonstration of single global merging effectiveness

[50] One-shot federated learning for LEO constellations that reduces convergence time from days to 90 minutes PDF

Cannot Refute

[51] Communication-Efficient Distributed Deep Learning via Federated Dynamic Averaging PDF

Cannot Refute

[52] K-DUMBs IoRT: Knowledge driven unified model block sharing in the Internet of Robotic Things PDF

Cannot Refute

[53] Multi-Device Cooperative Fine-Tuning of Foundation Models at the Network Edge PDF

Cannot Refute

[54] OSGAN: One-shot distributed learning using generative adversarial networks: A. Kasturi, C. Hota PDF

Cannot Refute

[55] Optimizing quantum federated learning: addressing non-I I D data challenges with global data sharing in weighted model averaging and clustering-based parameter selection PDF

Cannot Refute

[56] One-shot federated learning-based model-free reinforcement learning PDF

Cannot Refute

[57] MOHFL: Multi-Level One-Shot Hierarchical Federated Learning With Enhanced Model Aggregation Over Non-IID Data PDF

Cannot Refute

[58] Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices PDF

Cannot Refute

[59] DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models PDF

Cannot Refute

Contribution

Theoretical convergence analysis matching parallel SGD rate

[60] Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent PDF

Can Refute

[69] Asymptotic network independence in distributed stochastic optimization for machine learning PDF

Can Refute

[61] CEDAS: A Compressed Decentralized Stochastic Gradient Method With Improved Convergence PDF

Cannot Refute

[62] Does worst-performing agent lead the pack? analyzing agent dynamics in unified distributed sgd PDF

Cannot Refute

[63] A(DP)SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent With Differential Privacy PDF

Cannot Refute

[64] An improved convergence analysis for decentralized online stochastic non-convex optimization PDF

Cannot Refute

[65] Decentralized asynchronous nonconvex stochastic optimization on directed graphs PDF

Cannot Refute

[66] Tackling Data Heterogeneity: A New Unified Framework for Decentralized SGD with Sample-induced Topology PDF

Cannot Refute

[67] Improving the transient times for distributed stochastic gradient methods PDF

Cannot Refute

[68] Dâ(DP)2SGD: Decentralized Parallel SGD with Differential Privacy in Dynamic Networks PDF

Cannot Refute

Contribution

Theoretical explanation for temporal communication allocation

[43] Joint Coding and Scheduling Optimization for Distributed Learning over Wireless Edge Networks PDF

Cannot Refute

[59] DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models PDF

Cannot Refute

[70] Optimizing Distributed Computing Resources with Federated Learning: Task Scheduling and Communication Efficiency PDF

Cannot Refute

[71] DIST: Distributed Learning-based Energy-Efficient and Reliable Task Scheduling and Resource Allocation in Fog Computing PDF

Cannot Refute

[73] Communication optimization algorithms for distributed deep learning systems: A survey PDF

Cannot Refute

[74] New scheduling approach using reinforcement learning for heterogeneous distributed systems PDF

Cannot Refute

[75] Adaptive transmission scheduling in wireless networks for asynchronous federated learning PDF

Cannot Refute

[76] A layer selection optimizer for communication-efficient decentralized federated deep learning PDF

Cannot Refute

[77] Route-and-Aggregate Decentralized Federated Learning Under Communication Errors PDF

Cannot Refute

On The Surprising Effectiveness of a Single Global Merging in Decentralized Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[34] Scheduling and Communication Schemes for Decentralized Federated Learning PDF

[38] Loss Based Byzantine Resilience for Decentralized Learning PDF

Contribution Analysis

Empirical demonstration of single global merging effectiveness

[50] One-shot federated learning for LEO constellations that reduces convergence time from days to 90 minutes PDF

[51] Communication-Efficient Distributed Deep Learning via Federated Dynamic Averaging PDF

[52] K-DUMBs IoRT: Knowledge driven unified model block sharing in the Internet of Robotic Things PDF

[53] Multi-Device Cooperative Fine-Tuning of Foundation Models at the Network Edge PDF

[54] OSGAN: One-shot distributed learning using generative adversarial networks: A. Kasturi, C. Hota PDF

[55] Optimizing quantum federated learning: addressing non-I I D data challenges with global data sharing in weighted model averaging and clustering-based parameter selection PDF

[56] One-shot federated learning-based model-free reinforcement learning PDF

[57] MOHFL: Multi-Level One-Shot Hierarchical Federated Learning With Enhanced Model Aggregation Over Non-IID Data PDF

[58] Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices PDF

[59] DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models PDF

Theoretical convergence analysis matching parallel SGD rate

[60] Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent PDF

[69] Asymptotic network independence in distributed stochastic optimization for machine learning PDF

[61] CEDAS: A Compressed Decentralized Stochastic Gradient Method With Improved Convergence PDF

[62] Does worst-performing agent lead the pack? analyzing agent dynamics in unified distributed sgd PDF

[63] A(DP)SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent With Differential Privacy PDF

[64] An improved convergence analysis for decentralized online stochastic non-convex optimization PDF

[65] Decentralized asynchronous nonconvex stochastic optimization on directed graphs PDF

[66] Tackling Data Heterogeneity: A New Unified Framework for Decentralized SGD with Sample-induced Topology PDF

[67] Improving the transient times for distributed stochastic gradient methods PDF

[68] Dâ(DP)2SGD: Decentralized Parallel SGD with Differential Privacy in Dynamic Networks PDF

Theoretical explanation for temporal communication allocation

[43] Joint Coding and Scheduling Optimization for Distributed Learning over Wireless Edge Networks PDF

[59] DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models PDF

[70] Optimizing Distributed Computing Resources with Federated Learning: Task Scheduling and Communication Efficiency PDF

[71] DIST: Distributed Learning-based Energy-Efficient and Reliable Task Scheduling and Resource Allocation in Fog Computing PDF

[73] Communication optimization algorithms for distributed deep learning systems: A survey PDF

[74] New scheduling approach using reinforcement learning for heterogeneous distributed systems PDF

[75] Adaptive transmission scheduling in wireless networks for asynchronous federated learning PDF

[76] A layer selection optimizer for communication-efficient decentralized federated deep learning PDF

[77] Route-and-Aggregate Decentralized Federated Learning Under Communication Errors PDF

Table of Contents

[68] Dâ(DP)2SGD: Decentralized Parallel SGD with Differential Privacy in Dynamic Networks PDF