VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

VLMVLAEmpirical Study

Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLMs) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how the choice and specific capabilities of the underlying VLM affect the performance of VLA policies? We introduce \textbf{VLM4VLA}, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Our pipeline, though simple, proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that a VLM's general capabilities are poor predictors of its downstream task performance, contrary to common assumptions. Inconsistencies across benchmarks suggest that VLA policies require capabilities beyond what current VLMs pursue. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM's performance on specific embodied skills does not guarantee better downstream control performance. Lastly, our analysis also reveals that the vision encoder is a critical bottleneck, and the ability to fine-tune it is crucial for strong performance. These results highlight a significant gap between current VLM pretraining paradigms and the specific demands of embodied tasks. We will release our code, models, and evaluation logs at \href{https://sites.google.com/view/vlm4vla}{our anonymous website} to encourage further research and help better understanding in this direction.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces VLM4VLA, a minimal adaptation pipeline that converts general-purpose vision-language models into vision-language-action policies using a small set of learnable parameters. It resides in the VLM-to-VLA Adaptation Frameworks leaf, which contains only three papers total including this work. This is a relatively sparse research direction within the broader taxonomy of 50 papers across 19 leaf nodes, suggesting the specific question of systematic VLM-to-VLA conversion remains underexplored compared to more crowded areas like full-scale VLA architectures or general manipulation task execution.

The taxonomy tree reveals that VLM-to-VLA adaptation sits within the larger VLA Model Architectures and Design branch, which also includes compact/efficient models and full-scale architectures. Neighboring leaves address efficiency-focused designs like TinyVLA and advanced reasoning systems with memory or multimodal integration. The scope note for this leaf explicitly excludes end-to-end VLA designs, positioning the work as a bridge between pretrained VLMs and robotic control rather than a novel architecture from scratch. Related directions in data generation and evaluation frameworks provide complementary infrastructure, but the core adaptation methodology remains distinct.

Among 30 candidates examined, the minimal adaptation pipeline contribution shows overlap with 3 out of 10 candidates reviewed, while the systematic empirical study of VLM capabilities found no clear refutations across 10 candidates. The analysis of vision encoders as bottlenecks encountered 3 potentially overlapping works among 10 examined. The empirical study component appears more novel within this limited search scope, whereas the adaptation pipeline and encoder analysis face more substantial prior work. These statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage of the field.

Based on the limited search scope of 30 candidates, the work's novelty appears mixed: the systematic empirical investigation of how VLM capabilities transfer to embodied control seems less explored, while the minimal adaptation approach and encoder bottleneck analysis encounter more prior work. The sparse population of the VLM-to-VLA adaptation leaf suggests room for contributions, though the analysis cannot rule out relevant work outside the top-30 semantic matches examined.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: vision-language model capabilities for robotic manipulation tasks. The field has organized itself around several major branches that reflect different stages and aspects of building vision-language action (VLA) systems. VLA Model Architectures and Design focuses on how to construct or adapt pretrained vision-language models into action-generating policies, including frameworks that bridge VLMs to VLAs and efficient architectures like TinyVLA[1] or RoboMamba[11]. High-Level Planning and Reasoning addresses how these models can perform task decomposition and multi-step reasoning, while Multimodal and Sensory Integration explores incorporating additional modalities such as tactile or audio signals. Data Generation and Pretraining examines scalable data collection and pretraining strategies, and Testing and Evaluation Frameworks provides benchmarks and metrics for assessing VLA performance. Surveys and Reviews, including VLA Systematic Review[3] and VLA Recipe Survey[8], synthesize emerging best practices, while Specialized Applications and Domains target specific use cases such as industrial robotics or garment manipulation. Within this landscape, a particularly active line of work centers on VLM-to-VLA adaptation frameworks, which seek efficient pathways to convert large pretrained vision-language models into robotic controllers without prohibitive retraining costs. VLM4VLA[0] sits squarely in this branch, proposing methods to leverage existing VLM representations for action prediction. Nearby efforts like RoboUniView[9] and VLM Robot Imitators[12] similarly explore how to repurpose VLM encoders or align language-conditioned features with low-level control, though they may differ in whether they emphasize unified architectures or imitation-based fine-tuning. A contrasting theme appears in works that prioritize efficiency and deployment constraints, such as TinyVLA[1] and TinyVLA Fast Efficient[10], which focus on model compression and real-time inference. The central tension across these directions involves balancing the rich semantic understanding of large VLMs against the need for sample-efficient adaptation, computational feasibility, and robust generalization to novel manipulation scenarios.

Claimed Contributions

VLM4VLA minimal adaptation pipeline

Can Refute

10 retrieved papers

The authors propose a lightweight framework that adapts Vision-Language Models into Vision-Language-Action policies by adding fewer than 1% new parameters. This design enables fair comparison across different VLMs while maintaining competitive performance with more sophisticated architectures.

10 retrieved papers

Can Refute

Systematic empirical study of VLM capabilities for embodied control

10 retrieved papers

The authors conduct large-scale experiments evaluating 17 VLMs across three benchmarks (Calvin, SimplerEnv, Libero) to investigate how general VLM capabilities, embodied-specific fine-tuning, and vision encoder training strategies affect downstream manipulation performance. They reveal inconsistencies and gaps between VLM pretraining paradigms and embodied task demands.

10 retrieved papers

Analysis of vision encoder as critical bottleneck

Can Refute

10 retrieved papers

The authors identify through ablation studies that fine-tuning the vision encoder is essential for VLA performance, showing significant degradation when frozen. This finding highlights the importance of visual adaptation over simply scaling language model parameters for embodied tasks.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] Robouniview: Visual-language model with unified view representation for robotic manipulation PDF

Liu Fan-fan, Yan Feng, Zheng Liming, Feng, Chengjian, Huang Yiyang, Ma Lin (2024)

[12] Vision-language foundation models as effective robot imitators PDF

LI Xinghang, Xinghang Li, Liu, Minghuan, Minghuan Liu, Zhang Han-Bo, Hanbo Zhang, Yu, Cunjun, Cunjun Yu, Xu Jie, Jie Xu, Wu Hongtao, Hongtao Wu, Cheang, Chilam, Chilam Cheang, Jing Ya, Ya Jing, Chi-Hou Cheang, Zhang, Weinan, Weinan Zhang, Liu Huaping, Huaping Liu, Li Hang, Hang Li, Kong Tao, Tao Kong (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VLM4VLA minimal adaptation pipeline

[12] Vision-language foundation models as effective robot imitators PDF

Can Refute

[52] Smolvla: A vision-language-action model for affordable and efficient robotics PDF

Can Refute

[56] Vla-adapter: An effective paradigm for tiny-scale vision-language-action model PDF

Can Refute

[1] Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation PDF

Cannot Refute

[11] Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation PDF

Cannot Refute

[44] Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey PDF

Cannot Refute

[51] BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation PDF

Cannot Refute

[53] 3ds-vla: A 3d spatial-aware vision language action model for robust multi-task manipulation PDF

Cannot Refute

[54] What Matters in Employing Vision Language Models for Tokenizing Actions in Robot Control? PDF

Cannot Refute

[55] A survey on efficient vision-language-action models PDF

Cannot Refute

Contribution

Systematic empirical study of VLM capabilities for embodied control

[1] Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation PDF

Cannot Refute

[5] Robopoint: A vision-language model for spatial affordance prediction for robotics PDF

Cannot Refute

[20] ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation PDF

Cannot Refute

[44] Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey PDF

Cannot Refute

[65] Physically grounded vision-language models for robotic manipulation PDF

Cannot Refute

[66] Survey of vision-language-action models for embodied manipulation PDF

Cannot Refute

[67] Embodiedgpt: Vision-language pre-training via embodied chain of thought PDF

Cannot Refute

[68] VLP: Vision-Language Preference Learning for Embodied Manipulation PDF

Cannot Refute

[69] Momanipvla: Transferring vision-language-action models for general mobile manipulation PDF

Cannot Refute

[70] A survey on vision-language-action models for embodied ai PDF

Cannot Refute

Contribution

Analysis of vision encoder as critical bottleneck

[34] Instructvla: Vision-language-action instruction tuning from understanding to manipulation PDF

Can Refute

[44] Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey PDF

Can Refute

[57] Openvla: An open-source vision-language-action model PDF

Can Refute

[58] Fine-tuning vision-language-action models: Optimizing speed and success PDF

Cannot Refute

[59] Tgrpo: Fine-tuning vision-language-action model via trajectory-wise group relative policy optimization PDF

Cannot Refute

[60] The future of action recognition: are multi-modal visual language models the key? PDF

Cannot Refute

[61] Visual instruction tuning towards general-purpose multimodal model: A survey PDF

Cannot Refute

[62] Zeetad: Adapting pretrained vision-language model for zero-shot end-to-end temporal action detection PDF

Cannot Refute

[63] Coa-vla: Improving vision-language-action models via visual-text chain-of-affordance PDF

Cannot Refute

[64] Embodiment Transfer Learning for Vision-Language-Action Models PDF

Cannot Refute

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] Robouniview: Visual-language model with unified view representation for robotic manipulation PDF

[12] Vision-language foundation models as effective robot imitators PDF

Contribution Analysis

VLM4VLA minimal adaptation pipeline

[12] Vision-language foundation models as effective robot imitators PDF

[52] Smolvla: A vision-language-action model for affordable and efficient robotics PDF

[56] Vla-adapter: An effective paradigm for tiny-scale vision-language-action model PDF

[1] Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation PDF

[11] Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation PDF

[44] Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey PDF

[51] BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation PDF

[53] 3ds-vla: A 3d spatial-aware vision language action model for robust multi-task manipulation PDF

[54] What Matters in Employing Vision Language Models for Tokenizing Actions in Robot Control? PDF

[55] A survey on efficient vision-language-action models PDF

Systematic empirical study of VLM capabilities for embodied control

[1] Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation PDF

[5] Robopoint: A vision-language model for spatial affordance prediction for robotics PDF

[20] ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation PDF

[44] Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey PDF

[65] Physically grounded vision-language models for robotic manipulation PDF

[66] Survey of vision-language-action models for embodied manipulation PDF

[67] Embodiedgpt: Vision-language pre-training via embodied chain of thought PDF

[68] VLP: Vision-Language Preference Learning for Embodied Manipulation PDF

[69] Momanipvla: Transferring vision-language-action models for general mobile manipulation PDF

[70] A survey on vision-language-action models for embodied ai PDF

Analysis of vision encoder as critical bottleneck

[34] Instructvla: Vision-language-action instruction tuning from understanding to manipulation PDF

[44] Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey PDF

[57] Openvla: An open-source vision-language-action model PDF

[58] Fine-tuning vision-language-action models: Optimizing speed and success PDF

[59] Tgrpo: Fine-tuning vision-language-action model via trajectory-wise group relative policy optimization PDF

[60] The future of action recognition: are multi-modal visual language models the key? PDF

[61] Visual instruction tuning towards general-purpose multimodal model: A survey PDF

[62] Zeetad: Adapting pretrained vision-language model for zero-shot end-to-end temporal action detection PDF

[63] Coa-vla: Improving vision-language-action models via visual-text chain-of-affordance PDF

[64] Embodiment Transfer Learning for Vision-Language-Action Models PDF

Table of Contents