Enough is as good as a feast: A Comprehensive Analysis of How Reinforcement Learning Mitigates Task Conflicts in LLMs
Overview
Overall Novelty Assessment
The paper investigates how training paradigms—specifically supervised fine-tuning versus reinforcement learning—affect model merging effectiveness in large language models. It positions itself within the Parameter-Level Conflict Characterization leaf of the taxonomy, which contains only three papers total. This leaf focuses on analyzing interference patterns at the weight or neuron level to understand conflict origins. The sparse population suggests this is a relatively underexplored research direction, particularly regarding how training methodology influences mergability rather than post-hoc merging techniques themselves.
The taxonomy reveals a field heavily weighted toward merging techniques (training-free and training-dependent branches contain numerous papers) rather than foundational analysis of what makes models mergeable. The paper's neighboring leaves examine representation bias and distribution gaps, while sibling papers like Localizing Task Information and Spark of Neuron analyze where task knowledge resides and neuron-level activation patterns. This work diverges by examining training-time factors rather than post-training parameter analysis, connecting to the broader Training-Dependent Merging Approaches branch through its focus on how models are prepared for merging.
Among thirty candidates examined across three contributions, none were identified as clearly refuting the work. The systematic comparison of SFT versus RL paradigms examined ten candidates with zero refutable overlaps, as did the three-factor theoretical analysis and the demonstration of reduced task conflicts. This suggests the specific angle—training paradigm impact on mergability—has limited direct prior work within the search scope. However, the analysis explicitly notes this represents a limited literature search via top-K semantic matching, not an exhaustive field survey.
The contribution appears relatively novel within the examined scope, particularly in shifting focus from merging algorithms to training methodology. The sparse Parameter-Level Conflict Characterization leaf and absence of refuting candidates among thirty examined papers suggest this training-paradigm perspective fills a gap. However, the limited search scope and the field's rapid evolution mean comprehensive novelty assessment would require broader examination beyond semantic similarity matching.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors conduct comprehensive experiments across five representative tasks to systematically compare how models trained with supervised fine-tuning versus reinforcement learning behave when merged. They demonstrate that RL-trained models consistently preserve performance better after merging, regardless of merging methods, RL algorithms, or base models used.
The authors identify and analyze three key mechanisms explaining why RL mitigates task conflicts: on-policy data reduces gradient magnitudes, RL optimization objectives naturally attenuate parameter updates as models converge (enough is as good as a feast principle), and joint optimization over positive and negative examples leads to more unbiased task-specific parameter updates.
Through performance landscape visualization and conflict norm analysis, the authors show that RL-trained models exhibit significantly lower cross-task parameter interference compared to SFT models. They demonstrate that parameter updates from RL are more task-orthogonal and less disruptive when merged, while SFT updates tend to be more entangled across tasks.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[16] Localizing Task Information for Improved Model Merging and Compression PDF
[21] To See a World in a Spark of Neuron: Disentangling Multi-task Interference for Training-free Model Merging PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Systematic comparison of SFT and RL paradigms for model merging
The authors conduct comprehensive experiments across five representative tasks to systematically compare how models trained with supervised fine-tuning versus reinforcement learning behave when merged. They demonstrate that RL-trained models consistently preserve performance better after merging, regardless of merging methods, RL algorithms, or base models used.
[60] Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLMs PDF
[61] Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models PDF
[62] Training Language Models to Self-Correct via Reinforcement Learning PDF
[63] On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting PDF
[64] ERank: Fusing Supervised Fine-Tuning and Reinforcement Learning for Effective and Efficient Text Reranking PDF
[65] ReFT: Reasoning with Reinforced Fine-Tuning PDF
[66] Teaching Large Language Models to Reason with Reinforcement Learning PDF
[67] Fine-tuning large vision-language models as decision-making agents via reinforcement learning PDF
[68] Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved) PDF
[69] Rl is neither a panacea nor a mirage: Understanding supervised vs. reinforcement learning fine-tuning for llms PDF
Three-factor theoretical and empirical analysis of RL superiority
The authors identify and analyze three key mechanisms explaining why RL mitigates task conflicts: on-policy data reduces gradient magnitudes, RL optimization objectives naturally attenuate parameter updates as models converge (enough is as good as a feast principle), and joint optimization over positive and negative examples leads to more unbiased task-specific parameter updates.
[63] On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting PDF
[70] Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning PDF
[71] A proximal policy optimization with curiosity algorithm for virtual drone navigation PDF
[72] Collaborative Target Tracking Algorithm for Multi-Agent Based on MAPPO and BCTD PDF
[73] Constrained reinforcement learning has zero duality gap PDF
[74] A study of plasticity loss in on-policy deep reinforcement learning PDF
[75] Plato: Policy learning using adaptive trajectory optimization PDF
[76] Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization PDF
[77] Scaling up multi-task robotic reinforcement learning PDF
[78] Molecular Graph Generation with Deep Reinforced Multitask Network and Adversarial Imitation Learning PDF
Demonstration that RL reduces task conflicts in model merging
Through performance landscape visualization and conflict norm analysis, the authors show that RL-trained models exhibit significantly lower cross-task parameter interference compared to SFT models. They demonstrate that parameter updates from RL are more task-orthogonal and less disruptive when merged, while SFT updates tend to be more entangled across tasks.