OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Model MergingTask VectorData-Free Optimization

Foundation models update slowly due to resource-intensive training, whereas domain-specific models evolve rapidly between releases. Model merging seeks to combine multiple expert models into a single, more capable model, reducing storage and serving costs while supporting decentralized development. Despite its potential, previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks. Recently, Multimodal LLMs (MLLMs) that extend LLMs through large-scale multimodal training have gained traction. However, no benchmark exists for model merging research that clearly divides the tasks of MLLM training and evaluation. In this paper, $(i)$ we introduce a model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, studying both LoRA and full fine-tuning models. Moreover, we explore how model merging can combine different modalities (e.g., vision-language, audio-language, and video-language models), moving toward the Omni-language model. $(ii)$ We implement 10 model merging algorithms on the benchmark. Furthermore, we propose a novel method that removes noise from task vectors and robustly optimizes the merged vector based on a loss defined over task vector interactions, achieving an average performance gain of 2.48%. $(iii)$ We find that model merging offers a promising way for building improved MLLMs without requiring training data. Our results also demonstrate that the complementarity among multiple modalities outperforms individual modalities.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a benchmark for merging multimodal large language models across tasks like VQA, geometry, chart understanding, OCR, and grounding, alongside a novel OptMerge method for noise-reduced task vector optimization. It resides in the Homogeneous Model Merging leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader Model Merging Techniques and Optimization branch. This positioning suggests the work addresses a less crowded niche focused on parameter-level fusion for models with identical architectures, rather than heterogeneous or temporal merging scenarios.

The taxonomy reveals neighboring leaves addressing complementary challenges: Heterogeneous Architecture Merging tackles parameter asymmetry across distinct structures, Temporal Model Merging handles progressive integration over time, and Neuron-Level Parameter Fusion operates at finer granularity. The paper's exploration of cross-modality merging (vision-language, audio-language, video-language toward omni-language models) connects to the Cross-Modal Integration branch, particularly Audio-Visual Integration and Vision-Language Grounding, though these branches focus on alignment rather than parameter merging. The Training-Free Integration Strategies branch offers an alternative paradigm through orchestration and composition, contrasting with the paper's parameter-fusion approach.

Among twenty-one candidates examined, the benchmark contribution (Contribution A) showed no clear refutation across ten candidates, suggesting limited prior work on MLLM-specific merging benchmarks within the search scope. The OptMerge method (Contribution B) examined only one candidate without refutation, indicating sparse direct overlap in optimization-based task vector techniques. However, the modality merging exploration (Contribution C) encountered one refutable candidate among ten examined, pointing to some existing work on combining diverse modalities. These statistics reflect a targeted semantic search rather than exhaustive coverage, leaving open the possibility of additional relevant literature beyond the top-K matches.

Given the limited search scope and the sparse Homogeneous Model Merging leaf, the work appears to occupy a relatively underexplored intersection of MLLM benchmarking and parameter-level fusion. The single refutation for modality merging suggests partial overlap with prior cross-modal integration efforts, while the benchmark and OptMerge contributions show less direct precedent among the examined candidates. A broader literature review would be needed to confirm whether these impressions hold across the full research landscape.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: merging multimodal large language models for unified capabilities. The field has organized itself around several complementary branches that address different facets of combining specialized models into cohesive systems. Model Merging Techniques and Optimization focuses on algorithmic strategies for blending parameters or outputs from homogeneous or heterogeneous models, often exploring weight-space interpolation and optimization-driven fusion methods such as OptMerge[0] and AdaMMS[3]. Training-Free Integration Strategies emphasizes approaches that avoid costly retraining, leveraging orchestration mechanisms like Training-Free Orchestration[2] and dynamic routing schemes. Multimodal Alignment and Fusion Architectures tackles the challenge of aligning representations across vision, language, and other modalities, with works such as AnyGPT[6] and Cephalo[7] proposing unified tokenization or cross-modal projection layers. Catastrophic Forgetting Mitigation and Cross-Modal Integration address stability and coherence when merging diverse capabilities, while Domain-Specific Applications and Surveys provide contextualized use cases and comprehensive overviews like Model Merging Survey[4]. A particularly active line of work contrasts parameter-level merging—where models are combined by averaging or optimizing weights—with inference-time fusion strategies that coordinate outputs from multiple specialists without modifying their parameters. OptMerge[0] exemplifies the former, proposing optimization-based techniques to merge homogeneous models while preserving task-specific strengths. This approach sits closely alongside AdaMMS[3], which also targets adaptive merging of model parameters, and contrasts with training-free methods like Training-Free Orchestration[2] that dynamically select or blend model predictions at runtime. A recurring theme across these branches is the trade-off between computational efficiency and the depth of integration: parameter merging can yield compact unified models but may struggle with heterogeneous architectures, whereas training-free orchestration offers flexibility at the cost of higher inference overhead. Open questions remain around scalability to many modalities, robustness under distribution shift, and the extent to which merged models can retain fine-grained specialist knowledge without catastrophic interference.

Claimed Contributions

Model merging benchmark for multimodal LLMs

10 retrieved papers

The authors present the first benchmark specifically designed for evaluating model merging methods on multimodal large language models. It provides fine-grained categorization of MLLM capabilities across five tasks and includes both LoRA and full fine-tuning scenarios, along with publicly released weights and code.

10 retrieved papers

OptMerge method for robust task vector optimization

1 retrieved paper

The authors introduce OptMerge, a new model merging algorithm that applies low-rank approximations to eliminate redundant noise from task vectors and optimizes the merged vector through a loss function defined over task vector interactions, achieving improved performance over existing methods.

1 retrieved paper

Exploration of modality merging toward omni-language models

Can Refute

10 retrieved papers

The authors investigate using model merging to integrate different modality encoders (vision, audio, video) into a unified language model without requiring training data, demonstrating that complementarity among multiple modalities outperforms individual modalities and offering a data-free approach to building omni-modal systems.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging PDF

Wei Yongxian, Cheng RunXi, Yongxian Wei, Jin, Weike, Runxi Cheng, Yang, Enneng, Weike Jin, Shen, Li, Enneng Yang, Hou, Lu, Li Shen, Du Sinan, Lu Hou, Yuan Chun, Sinan Du, Cao Xiaochun, Chun Yuan, Tao, Dacheng, Xiaochun Cao, Dacheng Tao (2025)

[4] Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities PDF

Yang, Enneng, Eun Jin Yang, Shen, Li, Shen Li, Enneng Yang, Guo Guibing, Guibing Guo, Li Shen, Wang XingWei, Xingwei Wang, Cao Xiaochun, Xiaochun Cao, Zhang Jie, Jie Zhang, Tao, Dacheng, Dacheng Tao (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Model merging benchmark for multimodal LLMs

[22] Model Composition for Multimodal Large Language Models PDF

Cannot Refute

[56] Seed-bench: Benchmarking multimodal large language models PDF

Cannot Refute

[57] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models PDF

Cannot Refute

[58] Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models PDF

Cannot Refute

[59] JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks PDF

Cannot Refute

[60] SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension PDF

Cannot Refute

[61] Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models PDF

Cannot Refute

[62] Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models PDF

Cannot Refute

[63] MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark PDF

Cannot Refute

[64] WorldGPT: Empowering LLM as Multimodal World Model PDF

Cannot Refute

Contribution

OptMerge method for robust task vector optimization

[65] Dynamic Bidirectional Feature Enhancement Network for Thin Cloud Removal in Remote Sensing Images PDF

Cannot Refute

Contribution

Exploration of modality merging toward omni-language models

[22] Model Composition for Multimodal Large Language Models PDF

Can Refute

[37] Aligned Better, Listen Better for Audio-Visual Large Language Models PDF

Cannot Refute

[48] Video understanding with large language models: A survey PDF

Cannot Refute

[49] Internvideo2: Scaling foundation models for multimodal video understanding PDF

Cannot Refute

[50] PaLM-E: An Embodied Multimodal Language Model PDF

Cannot Refute

[51] video-salmonn: Speech-enhanced audio-visual large language models PDF

Cannot Refute

[52] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding PDF

Cannot Refute

[53] Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action PDF

Cannot Refute

[54] OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework PDF

Cannot Refute

[55] Chat-univi: Unified visual representation empowers large language models with image and video understanding PDF

Cannot Refute

OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging PDF

[4] Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities PDF

Contribution Analysis

Model merging benchmark for multimodal LLMs

[22] Model Composition for Multimodal Large Language Models PDF

[56] Seed-bench: Benchmarking multimodal large language models PDF

[57] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models PDF

[58] Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models PDF

[59] JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks PDF

[60] SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension PDF

[61] Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models PDF

[62] Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models PDF

[63] MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark PDF

[64] WorldGPT: Empowering LLM as Multimodal World Model PDF

OptMerge method for robust task vector optimization

[65] Dynamic Bidirectional Feature Enhancement Network for Thin Cloud Removal in Remote Sensing Images PDF

Exploration of modality merging toward omni-language models

[22] Model Composition for Multimodal Large Language Models PDF

[37] Aligned Better, Listen Better for Audio-Visual Large Language Models PDF

[48] Video understanding with large language models: A survey PDF

[49] Internvideo2: Scaling foundation models for multimodal video understanding PDF

[50] PaLM-E: An Embodied Multimodal Language Model PDF

[51] video-salmonn: Speech-enhanced audio-visual large language models PDF

[52] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding PDF

[53] Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action PDF

[54] OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework PDF

[55] Chat-univi: Unified visual representation empowers large language models with image and video understanding PDF

Table of Contents