Mordal: Automated Pretrained Model Selection for Vision Language Models
Overview
Overall Novelty Assessment
The paper introduces Mordal, an automated framework for selecting the best vision language model (VLM) for a user-defined task without manual intervention. According to the taxonomy tree, Mordal sits in the 'Automated Model Search Frameworks' leaf under 'Model Selection and Ranking Methods'. Notably, this leaf contains only the original paper itself—no sibling papers are listed. This suggests that end-to-end automated search systems for VLMs represent a relatively sparse research direction within the broader model selection landscape, which includes nineteen papers across multiple leaves.
The taxonomy reveals that neighboring leaves address related but distinct challenges. 'Text-Based VLM Selection' and 'Model Label Learning Paradigm' focus on language-only selection strategies, while 'Model Ranking Without Labels' explores unsupervised ranking. The 'Adaptation Techniques' branch emphasizes parameter tuning and prompt optimization rather than model discovery. Mordal's positioning indicates it targets the upstream problem of identifying which pretrained VLM to use, whereas sibling branches assume a model is already chosen and focus on refining it. The scope notes clarify that Mordal's search-based approach differs from pure ranking or reuse mechanisms.
Among twenty-seven candidates examined via limited semantic search, none were found to clearly refute any of Mordal's three contributions. The pretrained model selection problem formulation examined ten candidates with zero refutable overlaps; the Mordal framework itself examined seven candidates with the same outcome; and the two-step clustering strategy examined ten candidates, again with no refutations. This suggests that within the top-K semantic matches retrieved, no prior work directly anticipates Mordal's combination of automated search, candidate reduction, and efficient evaluation. However, the analysis is constrained by the search scope and does not claim exhaustive coverage.
Based on the limited literature search of twenty-seven candidates, Mordal appears to occupy a novel position as an end-to-end automated search framework for VLMs. The absence of sibling papers in its taxonomy leaf and the lack of refutable prior work among examined candidates suggest meaningful differentiation. Nonetheless, the analysis reflects top-K semantic retrieval and does not guarantee that no related work exists beyond the examined set.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors formulate the pretrained model selection problem specifically for vision language models as a resource-constrained task to predict alignment performance. They demonstrate empirically that existing VLMs do not consistently use optimal pretrained vision encoders and language models for different downstream tasks.
The authors introduce Mordal, an automated framework that efficiently searches for the best combination of pretrained vision encoder and language model for a specific task. The framework uses candidate clustering based on representation similarity, combined with early stopping and scaling prediction to minimize both the number of candidates evaluated and the time required per evaluation.
The authors develop a two-step clustering approach that first groups vision encoders by representation similarity using centered kernel alignment (CKA), then clusters language models based on fixed vision representations. This is followed by inter-cluster evaluation to eliminate weak clusters and intra-cluster evaluation to identify the best candidate within remaining clusters.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Pretrained model selection problem formulation for VLMs
The authors formulate the pretrained model selection problem specifically for vision language models as a resource-constrained task to predict alignment performance. They demonstrate empirically that existing VLMs do not consistently use optimal pretrained vision encoders and language models for different downstream tasks.
[4] Bridge the modality and capability gaps in vision-language model selection PDF
[20] Cogvlm: Visual expert for pretrained language models PDF
[21] Egovideo: Exploring egocentric foundation model and downstream adaptation PDF
[22] Vision-Language Foundation Models as Effective Robot Imitators PDF
[23] Visualâlanguage foundation models in medicine PDF
[24] Class-Specific Prompt Learning for VisionâLanguage Models PDF
[25] Transferring knowledge from large foundation models to small downstream models PDF
[26] Advancing Multimodal In-Context Learning in Large Vision-Language Models with Task-aware Demonstrations PDF
[27] EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis PDF
[28] ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks PDF
Mordal framework for automated VLM model selection
The authors introduce Mordal, an automated framework that efficiently searches for the best combination of pretrained vision encoder and language model for a specific task. The framework uses candidate clustering based on representation similarity, combined with early stopping and scaling prediction to minimize both the number of candidates evaluated and the time required per evaluation.
[29] Automated machine learning in action PDF
[30] Discovering new intents via constrained deep adaptive clustering with cluster refinement PDF
[31] Semi-Supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection PDF
[32] Document clustering with cluster refinement and model selection capabilities PDF
[33] Analyzing grid log data with affinity propagation PDF
[34] Auto-Tuning with Early Stopping in AutoPas PDF
[35] Efficiently pre-training language models with mixtures of cluster-oriented, trainability-aware experts PDF
Two-step clustering strategy with inter- and intra-cluster evaluation
The authors develop a two-step clustering approach that first groups vision encoders by representation similarity using centered kernel alignment (CKA), then clusters language models based on fixed vision representations. This is followed by inter-cluster evaluation to eliminate weak clusters and intra-cluster evaluation to identify the best candidate within remaining clusters.