OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

ICLR 2026 Conference SubmissionAnonymous Authors
Model MergingTask VectorData-Free Optimization
Abstract:

Foundation models update slowly due to resource-intensive training, whereas domain-specific models evolve rapidly between releases. Model merging seeks to combine multiple expert models into a single, more capable model, reducing storage and serving costs while supporting decentralized development. Despite its potential, previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks. Recently, Multimodal LLMs (MLLMs) that extend LLMs through large-scale multimodal training have gained traction. However, no benchmark exists for model merging research that clearly divides the tasks of MLLM training and evaluation. In this paper, (i)(i) we introduce a model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, studying both LoRA and full fine-tuning models. Moreover, we explore how model merging can combine different modalities (e.g., vision-language, audio-language, and video-language models), moving toward the Omni-language model. (ii)(ii) We implement 10 model merging algorithms on the benchmark. Furthermore, we propose a novel method that removes noise from task vectors and robustly optimizes the merged vector based on a loss defined over task vector interactions, achieving an average performance gain of 2.48%. (iii)(iii) We find that model merging offers a promising way for building improved MLLMs without requiring training data. Our results also demonstrate that the complementarity among multiple modalities outperforms individual modalities.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a benchmark for merging multimodal large language models across tasks like VQA, geometry, chart understanding, OCR, and grounding, alongside a novel OptMerge method for noise-reduced task vector optimization. It resides in the Homogeneous Model Merging leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader Model Merging Techniques and Optimization branch. This positioning suggests the work addresses a less crowded niche focused on parameter-level fusion for models with identical architectures, rather than heterogeneous or temporal merging scenarios.

The taxonomy reveals neighboring leaves addressing complementary challenges: Heterogeneous Architecture Merging tackles parameter asymmetry across distinct structures, Temporal Model Merging handles progressive integration over time, and Neuron-Level Parameter Fusion operates at finer granularity. The paper's exploration of cross-modality merging (vision-language, audio-language, video-language toward omni-language models) connects to the Cross-Modal Integration branch, particularly Audio-Visual Integration and Vision-Language Grounding, though these branches focus on alignment rather than parameter merging. The Training-Free Integration Strategies branch offers an alternative paradigm through orchestration and composition, contrasting with the paper's parameter-fusion approach.

Among twenty-one candidates examined, the benchmark contribution (Contribution A) showed no clear refutation across ten candidates, suggesting limited prior work on MLLM-specific merging benchmarks within the search scope. The OptMerge method (Contribution B) examined only one candidate without refutation, indicating sparse direct overlap in optimization-based task vector techniques. However, the modality merging exploration (Contribution C) encountered one refutable candidate among ten examined, pointing to some existing work on combining diverse modalities. These statistics reflect a targeted semantic search rather than exhaustive coverage, leaving open the possibility of additional relevant literature beyond the top-K matches.

Given the limited search scope and the sparse Homogeneous Model Merging leaf, the work appears to occupy a relatively underexplored intersection of MLLM benchmarking and parameter-level fusion. The single refutation for modality merging suggests partial overlap with prior cross-modal integration efforts, while the benchmark and OptMerge contributions show less direct precedent among the examined candidates. A broader literature review would be needed to confirm whether these impressions hold across the full research landscape.

Taxonomy

Core-task Taxonomy Papers
47
3
Claimed Contributions
21
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: merging multimodal large language models for unified capabilities. The field has organized itself around several complementary branches that address different facets of combining specialized models into cohesive systems. Model Merging Techniques and Optimization focuses on algorithmic strategies for blending parameters or outputs from homogeneous or heterogeneous models, often exploring weight-space interpolation and optimization-driven fusion methods such as OptMerge[0] and AdaMMS[3]. Training-Free Integration Strategies emphasizes approaches that avoid costly retraining, leveraging orchestration mechanisms like Training-Free Orchestration[2] and dynamic routing schemes. Multimodal Alignment and Fusion Architectures tackles the challenge of aligning representations across vision, language, and other modalities, with works such as AnyGPT[6] and Cephalo[7] proposing unified tokenization or cross-modal projection layers. Catastrophic Forgetting Mitigation and Cross-Modal Integration address stability and coherence when merging diverse capabilities, while Domain-Specific Applications and Surveys provide contextualized use cases and comprehensive overviews like Model Merging Survey[4]. A particularly active line of work contrasts parameter-level merging—where models are combined by averaging or optimizing weights—with inference-time fusion strategies that coordinate outputs from multiple specialists without modifying their parameters. OptMerge[0] exemplifies the former, proposing optimization-based techniques to merge homogeneous models while preserving task-specific strengths. This approach sits closely alongside AdaMMS[3], which also targets adaptive merging of model parameters, and contrasts with training-free methods like Training-Free Orchestration[2] that dynamically select or blend model predictions at runtime. A recurring theme across these branches is the trade-off between computational efficiency and the depth of integration: parameter merging can yield compact unified models but may struggle with heterogeneous architectures, whereas training-free orchestration offers flexibility at the cost of higher inference overhead. Open questions remain around scalability to many modalities, robustness under distribution shift, and the extent to which merged models can retain fine-grained specialist knowledge without catastrophic interference.

Claimed Contributions

Model merging benchmark for multimodal LLMs

The authors present the first benchmark specifically designed for evaluating model merging methods on multimodal large language models. It provides fine-grained categorization of MLLM capabilities across five tasks and includes both LoRA and full fine-tuning scenarios, along with publicly released weights and code.

10 retrieved papers
OptMerge method for robust task vector optimization

The authors introduce OptMerge, a new model merging algorithm that applies low-rank approximations to eliminate redundant noise from task vectors and optimizes the merged vector through a loss function defined over task vector interactions, achieving improved performance over existing methods.

1 retrieved paper
Exploration of modality merging toward omni-language models

The authors investigate using model merging to integrate different modality encoders (vision, audio, video) into a unified language model without requiring training data, demonstrating that complementarity among multiple modalities outperforms individual modalities and offering a data-free approach to building omni-modal systems.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Model merging benchmark for multimodal LLMs

The authors present the first benchmark specifically designed for evaluating model merging methods on multimodal large language models. It provides fine-grained categorization of MLLM capabilities across five tasks and includes both LoRA and full fine-tuning scenarios, along with publicly released weights and code.

Contribution

OptMerge method for robust task vector optimization

The authors introduce OptMerge, a new model merging algorithm that applies low-rank approximations to eliminate redundant noise from task vectors and optimizes the merged vector through a loss function defined over task vector interactions, achieving improved performance over existing methods.

Contribution

Exploration of modality merging toward omni-language models

The authors investigate using model merging to integrate different modality encoders (vision, audio, video) into a unified language model without requiring training data, demonstrating that complementarity among multiple modalities outperforms individual modalities and offering a data-free approach to building omni-modal systems.