Abstract:

Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLMs) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how the choice and specific capabilities of the underlying VLM affect the performance of VLA policies? We introduce \textbf{VLM4VLA}, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Our pipeline, though simple, proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that a VLM's general capabilities are poor predictors of its downstream task performance, contrary to common assumptions. Inconsistencies across benchmarks suggest that VLA policies require capabilities beyond what current VLMs pursue. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM's performance on specific embodied skills does not guarantee better downstream control performance. Lastly, our analysis also reveals that the vision encoder is a critical bottleneck, and the ability to fine-tune it is crucial for strong performance. These results highlight a significant gap between current VLM pretraining paradigms and the specific demands of embodied tasks. We will release our code, models, and evaluation logs at \href{https://sites.google.com/view/vlm4vla}{our anonymous website} to encourage further research and help better understanding in this direction.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces VLM4VLA, a minimal adaptation pipeline that converts general-purpose vision-language models into vision-language-action policies using a small set of learnable parameters. It resides in the VLM-to-VLA Adaptation Frameworks leaf, which contains only three papers total including this work. This is a relatively sparse research direction within the broader taxonomy of 50 papers across 19 leaf nodes, suggesting the specific question of systematic VLM-to-VLA conversion remains underexplored compared to more crowded areas like full-scale VLA architectures or general manipulation task execution.

The taxonomy tree reveals that VLM-to-VLA adaptation sits within the larger VLA Model Architectures and Design branch, which also includes compact/efficient models and full-scale architectures. Neighboring leaves address efficiency-focused designs like TinyVLA and advanced reasoning systems with memory or multimodal integration. The scope note for this leaf explicitly excludes end-to-end VLA designs, positioning the work as a bridge between pretrained VLMs and robotic control rather than a novel architecture from scratch. Related directions in data generation and evaluation frameworks provide complementary infrastructure, but the core adaptation methodology remains distinct.

Among 30 candidates examined, the minimal adaptation pipeline contribution shows overlap with 3 out of 10 candidates reviewed, while the systematic empirical study of VLM capabilities found no clear refutations across 10 candidates. The analysis of vision encoders as bottlenecks encountered 3 potentially overlapping works among 10 examined. The empirical study component appears more novel within this limited search scope, whereas the adaptation pipeline and encoder analysis face more substantial prior work. These statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage of the field.

Based on the limited search scope of 30 candidates, the work's novelty appears mixed: the systematic empirical investigation of how VLM capabilities transfer to embodied control seems less explored, while the minimal adaptation approach and encoder bottleneck analysis encounter more prior work. The sparse population of the VLM-to-VLA adaptation leaf suggests room for contributions, though the analysis cannot rule out relevant work outside the top-30 semantic matches examined.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
6
Refutable Paper

Research Landscape Overview

Core task: vision-language model capabilities for robotic manipulation tasks. The field has organized itself around several major branches that reflect different stages and aspects of building vision-language action (VLA) systems. VLA Model Architectures and Design focuses on how to construct or adapt pretrained vision-language models into action-generating policies, including frameworks that bridge VLMs to VLAs and efficient architectures like TinyVLA[1] or RoboMamba[11]. High-Level Planning and Reasoning addresses how these models can perform task decomposition and multi-step reasoning, while Multimodal and Sensory Integration explores incorporating additional modalities such as tactile or audio signals. Data Generation and Pretraining examines scalable data collection and pretraining strategies, and Testing and Evaluation Frameworks provides benchmarks and metrics for assessing VLA performance. Surveys and Reviews, including VLA Systematic Review[3] and VLA Recipe Survey[8], synthesize emerging best practices, while Specialized Applications and Domains target specific use cases such as industrial robotics or garment manipulation. Within this landscape, a particularly active line of work centers on VLM-to-VLA adaptation frameworks, which seek efficient pathways to convert large pretrained vision-language models into robotic controllers without prohibitive retraining costs. VLM4VLA[0] sits squarely in this branch, proposing methods to leverage existing VLM representations for action prediction. Nearby efforts like RoboUniView[9] and VLM Robot Imitators[12] similarly explore how to repurpose VLM encoders or align language-conditioned features with low-level control, though they may differ in whether they emphasize unified architectures or imitation-based fine-tuning. A contrasting theme appears in works that prioritize efficiency and deployment constraints, such as TinyVLA[1] and TinyVLA Fast Efficient[10], which focus on model compression and real-time inference. The central tension across these directions involves balancing the rich semantic understanding of large VLMs against the need for sample-efficient adaptation, computational feasibility, and robust generalization to novel manipulation scenarios.

Claimed Contributions

VLM4VLA minimal adaptation pipeline

The authors propose a lightweight framework that adapts Vision-Language Models into Vision-Language-Action policies by adding fewer than 1% new parameters. This design enables fair comparison across different VLMs while maintaining competitive performance with more sophisticated architectures.

10 retrieved papers
Can Refute
Systematic empirical study of VLM capabilities for embodied control

The authors conduct large-scale experiments evaluating 17 VLMs across three benchmarks (Calvin, SimplerEnv, Libero) to investigate how general VLM capabilities, embodied-specific fine-tuning, and vision encoder training strategies affect downstream manipulation performance. They reveal inconsistencies and gaps between VLM pretraining paradigms and embodied task demands.

10 retrieved papers
Analysis of vision encoder as critical bottleneck

The authors identify through ablation studies that fine-tuning the vision encoder is essential for VLA performance, showing significant degradation when frozen. This finding highlights the importance of visual adaptation over simply scaling language model parameters for embodied tasks.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VLM4VLA minimal adaptation pipeline

The authors propose a lightweight framework that adapts Vision-Language Models into Vision-Language-Action policies by adding fewer than 1% new parameters. This design enables fair comparison across different VLMs while maintaining competitive performance with more sophisticated architectures.

Contribution

Systematic empirical study of VLM capabilities for embodied control

The authors conduct large-scale experiments evaluating 17 VLMs across three benchmarks (Calvin, SimplerEnv, Libero) to investigate how general VLM capabilities, embodied-specific fine-tuning, and vision encoder training strategies affect downstream manipulation performance. They reveal inconsistencies and gaps between VLM pretraining paradigms and embodied task demands.

Contribution

Analysis of vision encoder as critical bottleneck

The authors identify through ablation studies that fine-tuning the vision encoder is essential for VLA performance, showing significant degradation when frozen. This finding highlights the importance of visual adaptation over simply scaling language model parameters for embodied tasks.