Enough is as good as a feast: A Comprehensive Analysis of How Reinforcement Learning Mitigates Task Conflicts in LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large language modelReinforcement learningModel merging

Model merging plays a crucial role in consolidating multiple specialized models into a single, unified model, especially in the era of large language models (LLMs). Recent research has primarily focused on developing strategies to enhance merging performance with the trained models, while the impact of training paradigms, such as supervised fine-tuning (SFT) and reinforcement learning (RL), on the effectiveness of model merging remains underexplored. In this study, we systematically explore the merging behavior of RL-trained LLMs compared to those trained with traditional SFT. Through comprehensive evaluations across five representative tasks, we find that RL significantly reduces task conflicts and results in less performance degradation after merging, making RL-trained models particularly well-suited for this process. To unearth the reasons behind the superior suitability of RL for model merging, we conduct extensive empirical experiments and theoretical analyses. Our findings highlight three key factors: (1) On-policy training data in RL control the gradient updates in a smaller magnitude, reducing the risk of overwriting existing knowledge for other tasks in the model. (2) The RL optimization objective, which favors "\textit{enough is as good as a feast}", progressively reduces the magnitude of parameter updates as the model converges, thereby alleviating inter-task conflicts. (3) Joint optimization of positive and negative examples in RL steers the model towards an unbiased task-specific parameter subspace, ensuring robust performance while further preventing parameter conflicts.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates how training paradigms—specifically supervised fine-tuning versus reinforcement learning—affect model merging effectiveness in large language models. It positions itself within the Parameter-Level Conflict Characterization leaf of the taxonomy, which contains only three papers total. This leaf focuses on analyzing interference patterns at the weight or neuron level to understand conflict origins. The sparse population suggests this is a relatively underexplored research direction, particularly regarding how training methodology influences mergability rather than post-hoc merging techniques themselves.

The taxonomy reveals a field heavily weighted toward merging techniques (training-free and training-dependent branches contain numerous papers) rather than foundational analysis of what makes models mergeable. The paper's neighboring leaves examine representation bias and distribution gaps, while sibling papers like Localizing Task Information and Spark of Neuron analyze where task knowledge resides and neuron-level activation patterns. This work diverges by examining training-time factors rather than post-training parameter analysis, connecting to the broader Training-Dependent Merging Approaches branch through its focus on how models are prepared for merging.

Among thirty candidates examined across three contributions, none were identified as clearly refuting the work. The systematic comparison of SFT versus RL paradigms examined ten candidates with zero refutable overlaps, as did the three-factor theoretical analysis and the demonstration of reduced task conflicts. This suggests the specific angle—training paradigm impact on mergability—has limited direct prior work within the search scope. However, the analysis explicitly notes this represents a limited literature search via top-K semantic matching, not an exhaustive field survey.

The contribution appears relatively novel within the examined scope, particularly in shifting focus from merging algorithms to training methodology. The sparse Parameter-Level Conflict Characterization leaf and absence of refuting candidates among thirty examined papers suggest this training-paradigm perspective fills a gap. However, the limited search scope and the field's rapid evolution mean comprehensive novelty assessment would require broader examination beyond semantic similarity matching.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: mitigating task conflicts in model merging for large language models. The field addresses how to combine multiple fine-tuned models without catastrophic interference, organizing itself into several major branches. Conflict Detection and Analysis focuses on identifying and characterizing parameter-level disagreements, with works like Localizing Task Information[16] and Spark of Neuron[21] examining where task-specific knowledge resides. Training-Free Merging Techniques, exemplified by TIES Merging[4] and Task Arithmetic[13], seek efficient combination strategies that avoid retraining, while Training-Dependent Merging Approaches such as AdaMerging[5] and Led Merging[2] optimize merge coefficients through additional learning. Post-Merging Refinement Techniques like Representation Surgery[23] adjust merged models after combination, and Domain-Specific Merging Applications explore targeted use cases. Continual and Federated Merging handles sequential or distributed scenarios, Security and Robustness Considerations address vulnerabilities like Merge Hijacking[6], and Alternative Knowledge Integration Paradigms such as Knowledge Grafting[3] explore fundamentally different composition strategies. A particularly active tension exists between training-free efficiency and training-dependent accuracy, with many studies exploring whether lightweight post-hoc methods can match optimization-based approaches. Security concerns have also emerged, as works like Neutralizing Backdoors[12] and Safety Aware Subspace[10] reveal that merging can inadvertently propagate or amplify harmful behaviors. Within the Conflict Detection and Analysis branch, Enough is Good[0] sits alongside parameter-level characterization efforts, examining how much conflict resolution is actually necessary for effective merging. Compared to neighbors like Localizing Task Information[16], which maps where task knowledge concentrates, and Spark of Neuron[21], which analyzes neuron-level activation patterns, Enough is Good[0] appears to question the sufficiency of existing conflict mitigation strategies, potentially offering a more pragmatic perspective on when elaborate conflict resolution yields diminishing returns versus simpler merging baselines.

Claimed Contributions

Systematic comparison of SFT and RL paradigms for model merging

10 retrieved papers

The authors conduct comprehensive experiments across five representative tasks to systematically compare how models trained with supervised fine-tuning versus reinforcement learning behave when merged. They demonstrate that RL-trained models consistently preserve performance better after merging, regardless of merging methods, RL algorithms, or base models used.

10 retrieved papers

Three-factor theoretical and empirical analysis of RL superiority

10 retrieved papers

The authors identify and analyze three key mechanisms explaining why RL mitigates task conflicts: on-policy data reduces gradient magnitudes, RL optimization objectives naturally attenuate parameter updates as models converge (enough is as good as a feast principle), and joint optimization over positive and negative examples leads to more unbiased task-specific parameter updates.

10 retrieved papers

Demonstration that RL reduces task conflicts in model merging

10 retrieved papers

Through performance landscape visualization and conflict norm analysis, the authors show that RL-trained models exhibit significantly lower cross-task parameter interference compared to SFT models. They demonstrate that parameter updates from RL are more task-orthogonal and less disruptive when merged, while SFT updates tend to be more entangled across tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[16] Localizing Task Information for Improved Model Merging and Compression PDF

Wang Ke, Dimitriadis, Nikolaos, Ke Wang, Ortiz-Jimenez, Guillermo, Nikolaos Dimitriadis, Fleuret, FranÃ§ois, Guillermo Ortiz-JimÃ©nez, Frossard, Pascal, Franccois Fleuret, Pascal Frossard (2024)

[21] To See a World in a Spark of Neuron: Disentangling Multi-task Interference for Training-free Model Merging PDF

Du Guodong, Yu Shuyang, Guo Yifei, Zhang Yi-wei, Cao Yiyao, Li Jing, Tang, Ho-Kin, Goh Sim Kuan (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic comparison of SFT and RL paradigms for model merging

[60] Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLMs PDF

Cannot Refute

[61] Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models PDF

Cannot Refute

[62] Training Language Models to Self-Correct via Reinforcement Learning PDF

Cannot Refute

[63] On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting PDF

Cannot Refute

[64] ERank: Fusing Supervised Fine-Tuning and Reinforcement Learning for Effective and Efficient Text Reranking PDF

Cannot Refute

[65] ReFT: Reasoning with Reinforced Fine-Tuning PDF

Cannot Refute

[66] Teaching Large Language Models to Reason with Reinforcement Learning PDF

Cannot Refute

[67] Fine-tuning large vision-language models as decision-making agents via reinforcement learning PDF

Cannot Refute

[68] Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved) PDF

Cannot Refute

[69] Rl is neither a panacea nor a mirage: Understanding supervised vs. reinforcement learning fine-tuning for llms PDF

Cannot Refute

Contribution

Three-factor theoretical and empirical analysis of RL superiority

[63] On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting PDF

Cannot Refute

[70] Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning PDF

Cannot Refute

[71] A proximal policy optimization with curiosity algorithm for virtual drone navigation PDF

Cannot Refute

[72] Collaborative Target Tracking Algorithm for Multi-Agent Based on MAPPO and BCTD PDF

Cannot Refute

[73] Constrained reinforcement learning has zero duality gap PDF

Cannot Refute

[74] A study of plasticity loss in on-policy deep reinforcement learning PDF

Cannot Refute

[75] Plato: Policy learning using adaptive trajectory optimization PDF

Cannot Refute

[76] Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization PDF

Cannot Refute

[77] Scaling up multi-task robotic reinforcement learning PDF

Cannot Refute

[78] Molecular Graph Generation with Deep Reinforced Multitask Network and Adversarial Imitation Learning PDF

Cannot Refute

Contribution

Demonstration that RL reduces task conflicts in model merging

[20] Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging PDF

Cannot Refute

[51] Conflict-Averse Gradient Descent for Multi-task Learning PDF

Cannot Refute

[52] Multi-task reinforcement learning with soft modularization PDF

Cannot Refute

[53] Gradient surgery for multi-task learning PDF

Cannot Refute

[54] Reinforced Model Merging PDF

Cannot Refute

[55] Multi-task reinforcement learning without interference PDF

Cannot Refute

[56] Distral: Robust multitask reinforcement learning PDF

Cannot Refute

[57] Same State, Different Task: Continual Reinforcement Learning without Interference PDF

Cannot Refute

[58] Automatic resolution of model merging conflicts using quality-based reinforcement learning PDF

Cannot Refute

[59] Scaling Intelligence Through Model Merging: A Comprehensive Survey PDF

Cannot Refute

Enough is as good as a feast: A Comprehensive Analysis of How Reinforcement Learning Mitigates Task Conflicts in LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[16] Localizing Task Information for Improved Model Merging and Compression PDF

[21] To See a World in a Spark of Neuron: Disentangling Multi-task Interference for Training-free Model Merging PDF

Contribution Analysis

Systematic comparison of SFT and RL paradigms for model merging

[60] Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLMs PDF

[61] Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models PDF

[62] Training Language Models to Self-Correct via Reinforcement Learning PDF

[63] On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting PDF

[64] ERank: Fusing Supervised Fine-Tuning and Reinforcement Learning for Effective and Efficient Text Reranking PDF

[65] ReFT: Reasoning with Reinforced Fine-Tuning PDF

[66] Teaching Large Language Models to Reason with Reinforcement Learning PDF

[67] Fine-tuning large vision-language models as decision-making agents via reinforcement learning PDF

[68] Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved) PDF

[69] Rl is neither a panacea nor a mirage: Understanding supervised vs. reinforcement learning fine-tuning for llms PDF

Three-factor theoretical and empirical analysis of RL superiority

[63] On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting PDF

[70] Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning PDF

[71] A proximal policy optimization with curiosity algorithm for virtual drone navigation PDF

[72] Collaborative Target Tracking Algorithm for Multi-Agent Based on MAPPO and BCTD PDF

[73] Constrained reinforcement learning has zero duality gap PDF

[74] A study of plasticity loss in on-policy deep reinforcement learning PDF

[75] Plato: Policy learning using adaptive trajectory optimization PDF

[76] Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization PDF

[77] Scaling up multi-task robotic reinforcement learning PDF

[78] Molecular Graph Generation with Deep Reinforced Multitask Network and Adversarial Imitation Learning PDF

Demonstration that RL reduces task conflicts in model merging

[20] Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging PDF

[51] Conflict-Averse Gradient Descent for Multi-task Learning PDF

[52] Multi-task reinforcement learning with soft modularization PDF

[53] Gradient surgery for multi-task learning PDF

[54] Reinforced Model Merging PDF

[55] Multi-task reinforcement learning without interference PDF

[56] Distral: Robust multitask reinforcement learning PDF

[57] Same State, Different Task: Continual Reinforcement Learning without Interference PDF

[58] Automatic resolution of model merging conflicts using quality-based reinforcement learning PDF

[59] Scaling Intelligence Through Model Merging: A Comprehensive Survey PDF

Table of Contents