Mordal: Automated Pretrained Model Selection for Vision Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Multimodal ModelVision Language ModelMode Selection

Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks. Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility. Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models.

We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time required to evaluate each remaining candidate. Our evaluation shows that Mordal can find the best VLM for a given problem using $8.9\times$ -- $11.6\times$ lower GPU hours than grid search. We have also discovered that Mordal achieves about 69% higher weighted Kendall’s $\tau$ on average than the state-of-the-art model selection method across diverse tasks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Mordal, an automated framework for selecting the best vision language model (VLM) for a user-defined task without manual intervention. According to the taxonomy tree, Mordal sits in the 'Automated Model Search Frameworks' leaf under 'Model Selection and Ranking Methods'. Notably, this leaf contains only the original paper itself—no sibling papers are listed. This suggests that end-to-end automated search systems for VLMs represent a relatively sparse research direction within the broader model selection landscape, which includes nineteen papers across multiple leaves.

The taxonomy reveals that neighboring leaves address related but distinct challenges. 'Text-Based VLM Selection' and 'Model Label Learning Paradigm' focus on language-only selection strategies, while 'Model Ranking Without Labels' explores unsupervised ranking. The 'Adaptation Techniques' branch emphasizes parameter tuning and prompt optimization rather than model discovery. Mordal's positioning indicates it targets the upstream problem of identifying which pretrained VLM to use, whereas sibling branches assume a model is already chosen and focus on refining it. The scope notes clarify that Mordal's search-based approach differs from pure ranking or reuse mechanisms.

Among twenty-seven candidates examined via limited semantic search, none were found to clearly refute any of Mordal's three contributions. The pretrained model selection problem formulation examined ten candidates with zero refutable overlaps; the Mordal framework itself examined seven candidates with the same outcome; and the two-step clustering strategy examined ten candidates, again with no refutations. This suggests that within the top-K semantic matches retrieved, no prior work directly anticipates Mordal's combination of automated search, candidate reduction, and efficient evaluation. However, the analysis is constrained by the search scope and does not claim exhaustive coverage.

Based on the limited literature search of twenty-seven candidates, Mordal appears to occupy a novel position as an end-to-end automated search framework for VLMs. The absence of sibling papers in its taxonomy leaf and the lack of refutable prior work among examined candidates suggest meaningful differentiation. Nonetheless, the analysis reflects top-K semantic retrieval and does not guarantee that no related work exists beyond the examined set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: automated pretrained model selection for vision language models. The field addresses the challenge of choosing among numerous pretrained VLMs for downstream tasks without exhaustive fine-tuning or evaluation. The taxonomy reveals four main branches: Model Selection and Ranking Methods develop automated search frameworks and ranking strategies to identify suitable models efficiently, often leveraging proxy metrics or lightweight evaluations such as those in Cheap and Quick[3] and AutoV[2]. Adaptation Techniques for Pretrained VLMs focus on parameter-efficient tuning, prompt engineering, and bridging modality gaps to improve model performance on specific domains, as seen in works like Bridge Modality Gaps[4] and Unsupervised Prototype Adapter[5]. Application-Specific VLM Systems deploy VLMs in specialized contexts ranging from robotic state recognition to social media disaster response, while VLM Architecture and Training examines foundational design choices and pretraining strategies that influence transferability. Recent efforts concentrate on reducing the computational burden of model selection while maintaining predictive accuracy. A handful of works explore ranking models without labels or using minimal data, contrasting expensive full evaluations with fast proxy-based approaches. Mordal[0] sits within the Automated Model Search Frameworks cluster, emphasizing efficient selection mechanisms that avoid costly retraining cycles. It shares common ground with Pretrained VLM Selection[17] and VLM Selection Reuse[9], which similarly aim to streamline the discovery of well-suited pretrained models. Compared to Cheap and Quick[3], which prioritizes speed through lightweight proxies, Mordal[0] appears to integrate more sophisticated search strategies that balance efficiency with selection quality. This positioning highlights an ongoing tension in the field: whether to rely on rapid heuristics or invest in richer but still tractable evaluation frameworks to guide practitioners toward optimal pretrained VLMs.

Claimed Contributions

Pretrained model selection problem formulation for VLMs

10 retrieved papers

The authors formulate the pretrained model selection problem specifically for vision language models as a resource-constrained task to predict alignment performance. They demonstrate empirically that existing VLMs do not consistently use optimal pretrained vision encoders and language models for different downstream tasks.

10 retrieved papers

Mordal framework for automated VLM model selection

7 retrieved papers

The authors introduce Mordal, an automated framework that efficiently searches for the best combination of pretrained vision encoder and language model for a specific task. The framework uses candidate clustering based on representation similarity, combined with early stopping and scaling prediction to minimize both the number of candidates evaluated and the time required per evaluation.

7 retrieved papers

Two-step clustering strategy with inter- and intra-cluster evaluation

10 retrieved papers

The authors develop a two-step clustering approach that first groups vision encoders by representation similarity using centered kernel alignment (CKA), then clusters language models based on fixed vision representations. This is followed by inter-cluster evaluation to eliminate weak clusters and intra-cluster evaluation to identify the best candidate within remaining clusters.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Pretrained model selection problem formulation for VLMs

[4] Bridge the modality and capability gaps in vision-language model selection PDF

Cannot Refute

[20] Cogvlm: Visual expert for pretrained language models PDF

Cannot Refute

[21] Egovideo: Exploring egocentric foundation model and downstream adaptation PDF

Cannot Refute

[22] Vision-Language Foundation Models as Effective Robot Imitators PDF

Cannot Refute

[23] Visualâlanguage foundation models in medicine PDF

Cannot Refute

[24] Class-Specific Prompt Learning for VisionâLanguage Models PDF

Cannot Refute

[25] Transferring knowledge from large foundation models to small downstream models PDF

Cannot Refute

[26] Advancing Multimodal In-Context Learning in Large Vision-Language Models with Task-aware Demonstrations PDF

Cannot Refute

[27] EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis PDF

Cannot Refute

[28] ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks PDF

Cannot Refute

Contribution

Mordal framework for automated VLM model selection

[29] Automated machine learning in action PDF

Cannot Refute

[30] Discovering new intents via constrained deep adaptive clustering with cluster refinement PDF

Cannot Refute

[31] Semi-Supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection PDF

Cannot Refute

[32] Document clustering with cluster refinement and model selection capabilities PDF

Cannot Refute

[33] Analyzing grid log data with affinity propagation PDF

Cannot Refute

[34] Auto-Tuning with Early Stopping in AutoPas PDF

Cannot Refute

[35] Efficiently pre-training language models with mixtures of cluster-oriented, trainability-aware experts PDF

Cannot Refute

Contribution

Two-step clustering strategy with inter- and intra-cluster evaluation

[36] Vision-semantics-label: A new two-step paradigm for action recognition with large language model PDF

Cannot Refute

[37] Generative multi-modal knowledge retrieval with large language models PDF

Cannot Refute

[38] Cascade Prompt Learning for Vision-Language Model Adaptation PDF

Cannot Refute

[39] Hybrid Clustering Framework for Scalable and Robust Query Analysis: Integrating Mini-Batch K-Means with DBSCAN: Hybrid Model for Complex Data Clustering. PDF

Cannot Refute

[40] Patent litigation mining using a large language modelâTaking unmanned aerial vehicle development as the case domain PDF

Cannot Refute

[41] Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages PDF

Cannot Refute

[42] VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation PDF

Cannot Refute

[43] ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time PDF

Cannot Refute

[44] LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge PDF

Cannot Refute

[45] UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval PDF

Cannot Refute

Mordal: Automated Pretrained Model Selection for Vision Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Pretrained model selection problem formulation for VLMs

[4] Bridge the modality and capability gaps in vision-language model selection PDF

[20] Cogvlm: Visual expert for pretrained language models PDF

[21] Egovideo: Exploring egocentric foundation model and downstream adaptation PDF

[22] Vision-Language Foundation Models as Effective Robot Imitators PDF

[23] Visualâlanguage foundation models in medicine PDF

[24] Class-Specific Prompt Learning for VisionâLanguage Models PDF

[25] Transferring knowledge from large foundation models to small downstream models PDF

[26] Advancing Multimodal In-Context Learning in Large Vision-Language Models with Task-aware Demonstrations PDF

[27] EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis PDF

[28] ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks PDF

Mordal framework for automated VLM model selection

[29] Automated machine learning in action PDF

[30] Discovering new intents via constrained deep adaptive clustering with cluster refinement PDF

[31] Semi-Supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection PDF

[32] Document clustering with cluster refinement and model selection capabilities PDF

[33] Analyzing grid log data with affinity propagation PDF

[34] Auto-Tuning with Early Stopping in AutoPas PDF

[35] Efficiently pre-training language models with mixtures of cluster-oriented, trainability-aware experts PDF

Two-step clustering strategy with inter- and intra-cluster evaluation

[36] Vision-semantics-label: A new two-step paradigm for action recognition with large language model PDF

[37] Generative multi-modal knowledge retrieval with large language models PDF

[38] Cascade Prompt Learning for Vision-Language Model Adaptation PDF

[39] Hybrid Clustering Framework for Scalable and Robust Query Analysis: Integrating Mini-Batch K-Means with DBSCAN: Hybrid Model for Complex Data Clustering. PDF

[40] Patent litigation mining using a large language modelâTaking unmanned aerial vehicle development as the case domain PDF

[41] Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages PDF

[42] VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation PDF

[43] ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time PDF

[44] LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge PDF

[45] UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval PDF

Table of Contents

[23] Visualâlanguage foundation models in medicine PDF

[24] Class-Specific Prompt Learning for VisionâLanguage Models PDF

[40] Patent litigation mining using a large language modelâTaking unmanned aerial vehicle development as the case domain PDF