OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging
Overview
Overall Novelty Assessment
The paper introduces a benchmark for merging multimodal large language models across tasks like VQA, geometry, chart understanding, OCR, and grounding, alongside a novel OptMerge method for noise-reduced task vector optimization. It resides in the Homogeneous Model Merging leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader Model Merging Techniques and Optimization branch. This positioning suggests the work addresses a less crowded niche focused on parameter-level fusion for models with identical architectures, rather than heterogeneous or temporal merging scenarios.
The taxonomy reveals neighboring leaves addressing complementary challenges: Heterogeneous Architecture Merging tackles parameter asymmetry across distinct structures, Temporal Model Merging handles progressive integration over time, and Neuron-Level Parameter Fusion operates at finer granularity. The paper's exploration of cross-modality merging (vision-language, audio-language, video-language toward omni-language models) connects to the Cross-Modal Integration branch, particularly Audio-Visual Integration and Vision-Language Grounding, though these branches focus on alignment rather than parameter merging. The Training-Free Integration Strategies branch offers an alternative paradigm through orchestration and composition, contrasting with the paper's parameter-fusion approach.
Among twenty-one candidates examined, the benchmark contribution (Contribution A) showed no clear refutation across ten candidates, suggesting limited prior work on MLLM-specific merging benchmarks within the search scope. The OptMerge method (Contribution B) examined only one candidate without refutation, indicating sparse direct overlap in optimization-based task vector techniques. However, the modality merging exploration (Contribution C) encountered one refutable candidate among ten examined, pointing to some existing work on combining diverse modalities. These statistics reflect a targeted semantic search rather than exhaustive coverage, leaving open the possibility of additional relevant literature beyond the top-K matches.
Given the limited search scope and the sparse Homogeneous Model Merging leaf, the work appears to occupy a relatively underexplored intersection of MLLM benchmarking and parameter-level fusion. The single refutation for modality merging suggests partial overlap with prior cross-modal integration efforts, while the benchmark and OptMerge contributions show less direct precedent among the examined candidates. A broader literature review would be needed to confirm whether these impressions hold across the full research landscape.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present the first benchmark specifically designed for evaluating model merging methods on multimodal large language models. It provides fine-grained categorization of MLLM capabilities across five tasks and includes both LoRA and full fine-tuning scenarios, along with publicly released weights and code.
The authors introduce OptMerge, a new model merging algorithm that applies low-rank approximations to eliminate redundant noise from task vectors and optimizes the merged vector through a loss function defined over task vector interactions, achieving improved performance over existing methods.
The authors investigate using model merging to integrate different modality encoders (vision, audio, video) into a unified language model without requiring training data, demonstrating that complementarity among multiple modalities outperforms individual modalities and offering a data-free approach to building omni-modal systems.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[1] Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging PDF
[4] Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Model merging benchmark for multimodal LLMs
The authors present the first benchmark specifically designed for evaluating model merging methods on multimodal large language models. It provides fine-grained categorization of MLLM capabilities across five tasks and includes both LoRA and full fine-tuning scenarios, along with publicly released weights and code.
[22] Model Composition for Multimodal Large Language Models PDF
[56] Seed-bench: Benchmarking multimodal large language models PDF
[57] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models PDF
[58] Affectgpt: A new dataset, model, and benchmark for emotion understanding with multimodal large language models PDF
[59] JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks PDF
[60] SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension PDF
[61] Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models PDF
[62] Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models PDF
[63] MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark PDF
[64] WorldGPT: Empowering LLM as Multimodal World Model PDF
OptMerge method for robust task vector optimization
The authors introduce OptMerge, a new model merging algorithm that applies low-rank approximations to eliminate redundant noise from task vectors and optimizes the merged vector through a loss function defined over task vector interactions, achieving improved performance over existing methods.
[65] Dynamic Bidirectional Feature Enhancement Network for Thin Cloud Removal in Remote Sensing Images PDF
Exploration of modality merging toward omni-language models
The authors investigate using model merging to integrate different modality encoders (vision, audio, video) into a unified language model without requiring training data, demonstrating that complementarity among multiple modalities outperforms individual modalities and offering a data-free approach to building omni-modal systems.