X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.3 Download Report PDF

robotics; vision language action model; prompt learning; heterogeneous pretrainining

Successful generalist Vision-Language-Action (VLA) models that rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich, diverse robotic data sources, we propose a novel Soft Prompt approach with minimally added parameters, by infusing prompt learning concepts into cross-embodiment robot learning and introducing separate sets of learnable embeddings for each distinct data source. These embeddings serve as embodiment-specific prompts, which in unity empower VLA models with effective exploitation of varying cross-embodiment features. Our new X-VLA, a neat flow-matching-based VLA architecture, relies exclusively on soft-prompted standard Transformer encoders with an enhanced encoding pipeline, enjoying both scalability and simplicity. Evaluated across 6 simulation environments as well as 3 real-world robotics platforms, our 0.9B instantiation-X-VLA-0.9B simultaneously achieves state-of-the-art performance over a sweep of benchmark suites, demonstrating superior results on a wide axes of capabilities, from flexible dexterity to quick adaptation across embodiments, environments, and tasks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes X-VLA, a flow-matching-based VLA architecture that introduces soft prompts—learnable embeddings for each data source—to handle cross-embodiment heterogeneity. It resides in the 'Latent Action and Flow-Based VLA Models' leaf, which contains four papers total, including the original work. This leaf sits within the broader 'VLA Model Architectures and Training Paradigms' branch, a moderately populated area with five distinct subcategories. The soft prompt mechanism aims to enable effective exploitation of varying cross-embodiment features, positioning the work at the intersection of architectural innovation and cross-embodiment generalization.

The taxonomy reveals that cross-embodiment adaptation is addressed in a separate branch ('Cross-Embodiment Generalization and Adaptation'), which includes embodiment-specific adaptation mechanisms and equivariance approaches. X-VLA's soft prompts conceptually bridge these areas: while the architecture itself is flow-based (placing it in the current leaf), the prompt mechanism targets embodiment-specific adaptation. Neighboring leaves include 'End-to-End VLA Foundations' (five papers) and 'Hierarchical and Modular VLA Systems' (four papers), suggesting that latent/flow-based approaches represent a distinct but not isolated research direction within the broader VLA architecture landscape.

Among 30 candidates examined, the soft prompt mechanism (Contribution A) and X-VLA architecture (Contribution B) show no clear refutation across 10 candidates each. However, the two-phase training pipeline (Contribution C) encountered one refutable candidate among 10 examined, indicating some overlap with existing training strategies. The limited search scope means these findings reflect top-30 semantic matches rather than exhaustive coverage. The architectural contributions appear more distinctive than the training methodology, though the scale of examination (10 candidates per contribution) suggests caution in drawing strong conclusions about absolute novelty.

Based on the limited literature search, X-VLA's core architectural innovations—particularly the soft prompt mechanism for cross-embodiment learning—appear relatively novel within the examined candidate set. The training pipeline shows more overlap with prior work. The taxonomy context indicates this work occupies a moderately populated research direction, with the soft prompt approach potentially bridging architectural and adaptation-focused research streams in a way not extensively covered by the four sibling papers in its leaf.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: cross-embodiment vision-language-action policy learning. This field aims to train robotic policies that can generalize across diverse robot morphologies, tasks, and environments by jointly modeling visual perception, natural language instructions, and action generation. The taxonomy reflects a maturing landscape organized around several complementary themes. VLA Model Architectures and Training Paradigms explores foundational design choices—ranging from transformer-based architectures like RT-2[3] and OpenVLA[2] to latent action and flow-based formulations—while Cross-Embodiment Generalization and Adaptation addresses how policies transfer knowledge across different robot platforms. VLA Fine-Tuning and Specialization and VLA Training Enhancements focus on adapting pretrained models to specific domains or improving training efficiency, whereas Multimodal Representation Enhancement investigates richer sensory fusion strategies. Meanwhile, VLA Efficiency and Deployment Optimization tackles practical constraints such as model compression (e.g., TinyVLA[7]), and Task-Specific VLA Applications examines targeted use cases from manipulation to GUI interaction. Finally, VLA Evaluation, Benchmarking, and Analysis provides the empirical infrastructure needed to measure progress systematically. Within this ecosystem, a particularly active line of work centers on latent action representations and flow-based modeling, where X-VLA[0] resides. These approaches encode actions in continuous latent spaces or leverage normalizing flows to capture multimodal action distributions, contrasting with discrete tokenization schemes prevalent in earlier VLA models. X-VLA[0] emphasizes learning expressive latent action policies that can handle complex, multi-step behaviors across varied embodiments, positioning it alongside methods like XR-1[27], Villa-X[29], and NORA[31], which similarly explore structured latent representations or hierarchical action generation. Compared to more direct end-to-end architectures such as OpenVLA[2] or fine-tuning-centric studies like Fine-tuning VLA[1], X-VLA[0] prioritizes representational flexibility and generalization through its latent formulation. This design choice reflects broader tensions in the field: balancing expressiveness against sample efficiency, and achieving cross-embodiment transfer without sacrificing task-specific performance.

Claimed Contributions

Soft Prompt mechanism for cross-embodiment VLA training

10 retrieved papers

The authors introduce a soft prompt mechanism that assigns learnable embeddings to each data source to capture embodiment-specific features. This approach addresses heterogeneity in cross-embodiment datasets by enabling the model to learn domain-specific hardware configurations with minimal parameter overhead.

10 retrieved papers

X-VLA architecture with soft-prompted Transformer encoders

10 retrieved papers

The authors propose X-VLA, a flow-matching-based VLA architecture built on soft-prompted standard Transformer encoders. The architecture features an enhanced multimodal encoding pipeline that processes multi-view images, language prompts, and proprioceptive features, designed for scalable cross-embodiment training.

10 retrieved papers

Two-phase training pipeline with customized learning recipe

Can Refute

10 retrieved papers

The authors develop a two-phase training methodology comprising heterogeneous pretraining on mixed-embodiment data followed by domain-specific adaptation. The pipeline includes custom learning rates, aligned action representations, intention abstraction through temporal downsampling, and balanced data sampling strategies.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[27] XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations PDF

Fan Shichao, Wu Kun, Che, Zhengping, Wang Xin-hua, Wu Di, Liao Fei, Liu Ning, Zhang Yi-xue, Zhao Zhen, Xu, Zhiyuan, Li Meng, Liu Qingjie, Zhang, Shanghang, wan min, Tang Jian (2025) • arXiv.org

[29] Villa-x: enhancing latent action modeling in vision-language-action models PDF

Chen Xiao-yu, Zhang, Pushi, Chuheng, Wang Kai-xin, Guo, Yanjiang, Yang, Rushuai, Wang, Yucen, Xiao, Xinquan, Zhao Li, Chen, Jianyu, Bian Jiang (2025)

[31] NORA-1.5: A Vision-Language-Action Model Trained using World Model-and Action-based Preference Rewards PDF

Chia-Yu Hung, Navonil Majumder, Haoyuan Deng, Liu Renhang, Yankang Ang, Amir Zadeh, Chuan Li, Dorien Herremans, Ziwei Wang, Soujanya Poria (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Soft Prompt mechanism for cross-embodiment VLA training

[60] Scaling Cross-Embodiment World Models for Dexterous Manipulation PDF

Cannot Refute

[61] Latent Action Diffusion for Cross-Embodiment Manipulation PDF

Cannot Refute

[62] Tenma: Robust Cross-Embodiment Robot Manipulation with Diffusion Transformer PDF

Cannot Refute

[63] UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations PDF

Cannot Refute

[64] Univla: Learning to act anywhere with task-centric latent actions PDF

Cannot Refute

[65] Xskill: Cross embodiment skill discovery PDF

Cannot Refute

[66] Discrete policy: Learning disentangled action space for multi-task robotic manipulation PDF

Cannot Refute

[67] Perspective-invariant 3D object detection PDF

Cannot Refute

[68] A hybrid deep architecture for robotic grasp detection PDF

Cannot Refute

[69] Tra-moe: Learning trajectory prediction model from multiple domains for adaptive policy conditioning PDF

Cannot Refute

Contribution

X-VLA architecture with soft-prompted Transformer encoders

[20] Ï0: A Vision-Language-Action Flow Model for General Robot Control PDF

Cannot Refute

[22] Vision-Language-Action Models: Foundations, Techniques and Applications PDF

Cannot Refute

[70] : A Vision-Language-Action Flow Model for General Robot Control PDF

Cannot Refute

[71] FlowVLA: Visual Chain of Thought-based Motion Reasoning for Vision-Language-Action Models PDF

Cannot Refute

[72] Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model PDF

Cannot Refute

[73] Recipe for Vision-Language-Action Models in Robotic Manipulation: A Survey PDF

Cannot Refute

[74] Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies PDF

Cannot Refute

[75] EO-1: Interleaved Vision-Text-Action Pretraining for General Robot Control PDF

Cannot Refute

[76] Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy PDF

Cannot Refute

[77] OPAL: Encoding Causal Understanding of Physical Systems for Robot Learning PDF

Cannot Refute

Contribution

Two-phase training pipeline with customized learning recipe

[56] Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation PDF

Can Refute

[25] Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge PDF

Cannot Refute

[51] Dual-weight attention-based multi-source multi-stage alignment domain adaptation for industrial fault diagnosis PDF

Cannot Refute

[52] TLAC: Two-Stage LMM Augmented CLIP for Zero-Shot Classification PDF

Cannot Refute

[53] Integrating PROSPECT-D physics and adversarial domain adaptation resnet for robust cross-ecosystem plant traits estimation PDF

Cannot Refute

[54] A Novel Pairwise Domain-Adaptation-Assisted Dual-Task Learning Approach to Coprediction of Robotic Machining Efficiency and Quality in New Parameter Spaces PDF

Cannot Refute

[55] Chatvla-2: Vision-language-action model with open-world embodied reasoning from pretrained knowledge PDF

Cannot Refute

[57] CE-Nav: Flow-Guided Reinforcement Refinement for Cross-Embodiment Local Navigation PDF

Cannot Refute

[58] Underwater salient object detection via dual-stage self-paced learning and depth emphasis PDF

Cannot Refute

[59] RVDNet: A Two-Stage Network for Real-World Video Desnowing with Domain Adaptation PDF

Cannot Refute

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[27] XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations PDF

[29] Villa-x: enhancing latent action modeling in vision-language-action models PDF

[31] NORA-1.5: A Vision-Language-Action Model Trained using World Model-and Action-based Preference Rewards PDF

Contribution Analysis

Soft Prompt mechanism for cross-embodiment VLA training

[60] Scaling Cross-Embodiment World Models for Dexterous Manipulation PDF

[61] Latent Action Diffusion for Cross-Embodiment Manipulation PDF

[62] Tenma: Robust Cross-Embodiment Robot Manipulation with Diffusion Transformer PDF

[63] UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations PDF

[64] Univla: Learning to act anywhere with task-centric latent actions PDF

[65] Xskill: Cross embodiment skill discovery PDF

[66] Discrete policy: Learning disentangled action space for multi-task robotic manipulation PDF

[67] Perspective-invariant 3D object detection PDF

[68] A hybrid deep architecture for robotic grasp detection PDF

[69] Tra-moe: Learning trajectory prediction model from multiple domains for adaptive policy conditioning PDF

X-VLA architecture with soft-prompted Transformer encoders

[20] Ï0: A Vision-Language-Action Flow Model for General Robot Control PDF

[22] Vision-Language-Action Models: Foundations, Techniques and Applications PDF

[70] : A Vision-Language-Action Flow Model for General Robot Control PDF

[71] FlowVLA: Visual Chain of Thought-based Motion Reasoning for Vision-Language-Action Models PDF

[72] Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model PDF

[73] Recipe for Vision-Language-Action Models in Robotic Manipulation: A Survey PDF

[74] Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies PDF

[75] EO-1: Interleaved Vision-Text-Action Pretraining for General Robot Control PDF

[76] Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy PDF

[77] OPAL: Encoding Causal Understanding of Physical Systems for Robot Learning PDF

Two-phase training pipeline with customized learning recipe

[56] Vidman: Exploiting implicit dynamics from video diffusion model for effective robot manipulation PDF

[25] Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge PDF

[51] Dual-weight attention-based multi-source multi-stage alignment domain adaptation for industrial fault diagnosis PDF

[52] TLAC: Two-Stage LMM Augmented CLIP for Zero-Shot Classification PDF

[53] Integrating PROSPECT-D physics and adversarial domain adaptation resnet for robust cross-ecosystem plant traits estimation PDF

[54] A Novel Pairwise Domain-Adaptation-Assisted Dual-Task Learning Approach to Coprediction of Robotic Machining Efficiency and Quality in New Parameter Spaces PDF

[55] Chatvla-2: Vision-language-action model with open-world embodied reasoning from pretrained knowledge PDF

[57] CE-Nav: Flow-Guided Reinforcement Refinement for Cross-Embodiment Local Navigation PDF

[58] Underwater salient object detection via dual-stage self-paced learning and depth emphasis PDF

[59] RVDNet: A Two-Stage Network for Real-World Video Desnowing with Domain Adaptation PDF

Table of Contents

[20] Ï0: A Vision-Language-Action Flow Model for General Robot Control PDF