VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models
Overview
Overall Novelty Assessment
The paper introduces VLM4VLA, a minimal adaptation pipeline that converts general-purpose vision-language models into vision-language-action policies using a small set of learnable parameters. It resides in the VLM-to-VLA Adaptation Frameworks leaf, which contains only three papers total including this work. This is a relatively sparse research direction within the broader taxonomy of 50 papers across 19 leaf nodes, suggesting the specific question of systematic VLM-to-VLA conversion remains underexplored compared to more crowded areas like full-scale VLA architectures or general manipulation task execution.
The taxonomy tree reveals that VLM-to-VLA adaptation sits within the larger VLA Model Architectures and Design branch, which also includes compact/efficient models and full-scale architectures. Neighboring leaves address efficiency-focused designs like TinyVLA and advanced reasoning systems with memory or multimodal integration. The scope note for this leaf explicitly excludes end-to-end VLA designs, positioning the work as a bridge between pretrained VLMs and robotic control rather than a novel architecture from scratch. Related directions in data generation and evaluation frameworks provide complementary infrastructure, but the core adaptation methodology remains distinct.
Among 30 candidates examined, the minimal adaptation pipeline contribution shows overlap with 3 out of 10 candidates reviewed, while the systematic empirical study of VLM capabilities found no clear refutations across 10 candidates. The analysis of vision encoders as bottlenecks encountered 3 potentially overlapping works among 10 examined. The empirical study component appears more novel within this limited search scope, whereas the adaptation pipeline and encoder analysis face more substantial prior work. These statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage of the field.
Based on the limited search scope of 30 candidates, the work's novelty appears mixed: the systematic empirical investigation of how VLM capabilities transfer to embodied control seems less explored, while the minimal adaptation approach and encoder bottleneck analysis encounter more prior work. The sparse population of the VLM-to-VLA adaptation leaf suggests room for contributions, though the analysis cannot rule out relevant work outside the top-30 semantic matches examined.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose a lightweight framework that adapts Vision-Language Models into Vision-Language-Action policies by adding fewer than 1% new parameters. This design enables fair comparison across different VLMs while maintaining competitive performance with more sophisticated architectures.
The authors conduct large-scale experiments evaluating 17 VLMs across three benchmarks (Calvin, SimplerEnv, Libero) to investigate how general VLM capabilities, embodied-specific fine-tuning, and vision encoder training strategies affect downstream manipulation performance. They reveal inconsistencies and gaps between VLM pretraining paradigms and embodied task demands.
The authors identify through ablation studies that fine-tuning the vision encoder is essential for VLA performance, showing significant degradation when frozen. This finding highlights the importance of visual adaptation over simply scaling language model parameters for embodied tasks.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[9] Robouniview: Visual-language model with unified view representation for robotic manipulation PDF
[12] Vision-language foundation models as effective robot imitators PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
VLM4VLA minimal adaptation pipeline
The authors propose a lightweight framework that adapts Vision-Language Models into Vision-Language-Action policies by adding fewer than 1% new parameters. This design enables fair comparison across different VLMs while maintaining competitive performance with more sophisticated architectures.
[12] Vision-language foundation models as effective robot imitators PDF
[52] Smolvla: A vision-language-action model for affordable and efficient robotics PDF
[56] Vla-adapter: An effective paradigm for tiny-scale vision-language-action model PDF
[1] Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation PDF
[11] Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation PDF
[44] Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey PDF
[51] BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation PDF
[53] 3ds-vla: A 3d spatial-aware vision language action model for robust multi-task manipulation PDF
[54] What Matters in Employing Vision Language Models for Tokenizing Actions in Robot Control? PDF
[55] A survey on efficient vision-language-action models PDF
Systematic empirical study of VLM capabilities for embodied control
The authors conduct large-scale experiments evaluating 17 VLMs across three benchmarks (Calvin, SimplerEnv, Libero) to investigate how general VLM capabilities, embodied-specific fine-tuning, and vision encoder training strategies affect downstream manipulation performance. They reveal inconsistencies and gaps between VLM pretraining paradigms and embodied task demands.
[1] Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation PDF
[5] Robopoint: A vision-language model for spatial affordance prediction for robotics PDF
[20] ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation PDF
[44] Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey PDF
[65] Physically grounded vision-language models for robotic manipulation PDF
[66] Survey of vision-language-action models for embodied manipulation PDF
[67] Embodiedgpt: Vision-language pre-training via embodied chain of thought PDF
[68] VLP: Vision-Language Preference Learning for Embodied Manipulation PDF
[69] Momanipvla: Transferring vision-language-action models for general mobile manipulation PDF
[70] A survey on vision-language-action models for embodied ai PDF
Analysis of vision encoder as critical bottleneck
The authors identify through ablation studies that fine-tuning the vision encoder is essential for VLA performance, showing significant degradation when frozen. This finding highlights the importance of visual adaptation over simply scaling language model parameters for embodied tasks.