Mordal: Automated Pretrained Model Selection for Vision Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
Multimodal ModelVision Language ModelMode Selection
Abstract:

Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks. Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility. Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models.

We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time required to evaluate each remaining candidate. Our evaluation shows that Mordal can find the best VLM for a given problem using 8.9×8.9\times--11.6×11.6\times lower GPU hours than grid search. We have also discovered that Mordal achieves about 69% higher weighted Kendall’s τ\tau on average than the state-of-the-art model selection method across diverse tasks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Mordal, an automated framework for selecting the best vision language model (VLM) for a user-defined task without manual intervention. According to the taxonomy tree, Mordal sits in the 'Automated Model Search Frameworks' leaf under 'Model Selection and Ranking Methods'. Notably, this leaf contains only the original paper itself—no sibling papers are listed. This suggests that end-to-end automated search systems for VLMs represent a relatively sparse research direction within the broader model selection landscape, which includes nineteen papers across multiple leaves.

The taxonomy reveals that neighboring leaves address related but distinct challenges. 'Text-Based VLM Selection' and 'Model Label Learning Paradigm' focus on language-only selection strategies, while 'Model Ranking Without Labels' explores unsupervised ranking. The 'Adaptation Techniques' branch emphasizes parameter tuning and prompt optimization rather than model discovery. Mordal's positioning indicates it targets the upstream problem of identifying which pretrained VLM to use, whereas sibling branches assume a model is already chosen and focus on refining it. The scope notes clarify that Mordal's search-based approach differs from pure ranking or reuse mechanisms.

Among twenty-seven candidates examined via limited semantic search, none were found to clearly refute any of Mordal's three contributions. The pretrained model selection problem formulation examined ten candidates with zero refutable overlaps; the Mordal framework itself examined seven candidates with the same outcome; and the two-step clustering strategy examined ten candidates, again with no refutations. This suggests that within the top-K semantic matches retrieved, no prior work directly anticipates Mordal's combination of automated search, candidate reduction, and efficient evaluation. However, the analysis is constrained by the search scope and does not claim exhaustive coverage.

Based on the limited literature search of twenty-seven candidates, Mordal appears to occupy a novel position as an end-to-end automated search framework for VLMs. The absence of sibling papers in its taxonomy leaf and the lack of refutable prior work among examined candidates suggest meaningful differentiation. Nonetheless, the analysis reflects top-K semantic retrieval and does not guarantee that no related work exists beyond the examined set.

Taxonomy

Core-task Taxonomy Papers
19
3
Claimed Contributions
27
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: automated pretrained model selection for vision language models. The field addresses the challenge of choosing among numerous pretrained VLMs for downstream tasks without exhaustive fine-tuning or evaluation. The taxonomy reveals four main branches: Model Selection and Ranking Methods develop automated search frameworks and ranking strategies to identify suitable models efficiently, often leveraging proxy metrics or lightweight evaluations such as those in Cheap and Quick[3] and AutoV[2]. Adaptation Techniques for Pretrained VLMs focus on parameter-efficient tuning, prompt engineering, and bridging modality gaps to improve model performance on specific domains, as seen in works like Bridge Modality Gaps[4] and Unsupervised Prototype Adapter[5]. Application-Specific VLM Systems deploy VLMs in specialized contexts ranging from robotic state recognition to social media disaster response, while VLM Architecture and Training examines foundational design choices and pretraining strategies that influence transferability. Recent efforts concentrate on reducing the computational burden of model selection while maintaining predictive accuracy. A handful of works explore ranking models without labels or using minimal data, contrasting expensive full evaluations with fast proxy-based approaches. Mordal[0] sits within the Automated Model Search Frameworks cluster, emphasizing efficient selection mechanisms that avoid costly retraining cycles. It shares common ground with Pretrained VLM Selection[17] and VLM Selection Reuse[9], which similarly aim to streamline the discovery of well-suited pretrained models. Compared to Cheap and Quick[3], which prioritizes speed through lightweight proxies, Mordal[0] appears to integrate more sophisticated search strategies that balance efficiency with selection quality. This positioning highlights an ongoing tension in the field: whether to rely on rapid heuristics or invest in richer but still tractable evaluation frameworks to guide practitioners toward optimal pretrained VLMs.

Claimed Contributions

Pretrained model selection problem formulation for VLMs

The authors formulate the pretrained model selection problem specifically for vision language models as a resource-constrained task to predict alignment performance. They demonstrate empirically that existing VLMs do not consistently use optimal pretrained vision encoders and language models for different downstream tasks.

10 retrieved papers
Mordal framework for automated VLM model selection

The authors introduce Mordal, an automated framework that efficiently searches for the best combination of pretrained vision encoder and language model for a specific task. The framework uses candidate clustering based on representation similarity, combined with early stopping and scaling prediction to minimize both the number of candidates evaluated and the time required per evaluation.

7 retrieved papers
Two-step clustering strategy with inter- and intra-cluster evaluation

The authors develop a two-step clustering approach that first groups vision encoders by representation similarity using centered kernel alignment (CKA), then clusters language models based on fixed vision representations. This is followed by inter-cluster evaluation to eliminate weak clusters and intra-cluster evaluation to identify the best candidate within remaining clusters.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Pretrained model selection problem formulation for VLMs

The authors formulate the pretrained model selection problem specifically for vision language models as a resource-constrained task to predict alignment performance. They demonstrate empirically that existing VLMs do not consistently use optimal pretrained vision encoders and language models for different downstream tasks.

Contribution

Mordal framework for automated VLM model selection

The authors introduce Mordal, an automated framework that efficiently searches for the best combination of pretrained vision encoder and language model for a specific task. The framework uses candidate clustering based on representation similarity, combined with early stopping and scaling prediction to minimize both the number of candidates evaluated and the time required per evaluation.

Contribution

Two-step clustering strategy with inter- and intra-cluster evaluation

The authors develop a two-step clustering approach that first groups vision encoders by representation similarity using centered kernel alignment (CKA), then clusters language models based on fixed vision representations. This is followed by inter-cluster evaluation to eliminate weak clusters and intra-cluster evaluation to identify the best candidate within remaining clusters.