X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

ICLR 2026 Conference SubmissionAnonymous Authors
robotics; vision language action model; prompt learning; heterogeneous pretrainining
Abstract:

Successful generalist Vision-Language-Action (VLA) models that rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich, diverse robotic data sources, we propose a novel Soft Prompt approach with minimally added parameters, by infusing prompt learning concepts into cross-embodiment robot learning and introducing separate sets of learnable embeddings for each distinct data source. These embeddings serve as embodiment-specific prompts, which in unity empower VLA models with effective exploitation of varying cross-embodiment features. Our new X-VLA, a neat flow-matching-based VLA architecture, relies exclusively on soft-prompted standard Transformer encoders with an enhanced encoding pipeline, enjoying both scalability and simplicity. Evaluated across 6 simulation environments as well as 3 real-world robotics platforms, our 0.9B instantiation-X-VLA-0.9B simultaneously achieves state-of-the-art performance over a sweep of benchmark suites, demonstrating superior results on a wide axes of capabilities, from flexible dexterity to quick adaptation across embodiments, environments, and tasks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes X-VLA, a flow-matching-based VLA architecture that introduces soft prompts—learnable embeddings for each data source—to handle cross-embodiment heterogeneity. It resides in the 'Latent Action and Flow-Based VLA Models' leaf, which contains four papers total, including the original work. This leaf sits within the broader 'VLA Model Architectures and Training Paradigms' branch, a moderately populated area with five distinct subcategories. The soft prompt mechanism aims to enable effective exploitation of varying cross-embodiment features, positioning the work at the intersection of architectural innovation and cross-embodiment generalization.

The taxonomy reveals that cross-embodiment adaptation is addressed in a separate branch ('Cross-Embodiment Generalization and Adaptation'), which includes embodiment-specific adaptation mechanisms and equivariance approaches. X-VLA's soft prompts conceptually bridge these areas: while the architecture itself is flow-based (placing it in the current leaf), the prompt mechanism targets embodiment-specific adaptation. Neighboring leaves include 'End-to-End VLA Foundations' (five papers) and 'Hierarchical and Modular VLA Systems' (four papers), suggesting that latent/flow-based approaches represent a distinct but not isolated research direction within the broader VLA architecture landscape.

Among 30 candidates examined, the soft prompt mechanism (Contribution A) and X-VLA architecture (Contribution B) show no clear refutation across 10 candidates each. However, the two-phase training pipeline (Contribution C) encountered one refutable candidate among 10 examined, indicating some overlap with existing training strategies. The limited search scope means these findings reflect top-30 semantic matches rather than exhaustive coverage. The architectural contributions appear more distinctive than the training methodology, though the scale of examination (10 candidates per contribution) suggests caution in drawing strong conclusions about absolute novelty.

Based on the limited literature search, X-VLA's core architectural innovations—particularly the soft prompt mechanism for cross-embodiment learning—appear relatively novel within the examined candidate set. The training pipeline shows more overlap with prior work. The taxonomy context indicates this work occupies a moderately populated research direction, with the soft prompt approach potentially bridging architectural and adaptation-focused research streams in a way not extensively covered by the four sibling papers in its leaf.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: cross-embodiment vision-language-action policy learning. This field aims to train robotic policies that can generalize across diverse robot morphologies, tasks, and environments by jointly modeling visual perception, natural language instructions, and action generation. The taxonomy reflects a maturing landscape organized around several complementary themes. VLA Model Architectures and Training Paradigms explores foundational design choices—ranging from transformer-based architectures like RT-2[3] and OpenVLA[2] to latent action and flow-based formulations—while Cross-Embodiment Generalization and Adaptation addresses how policies transfer knowledge across different robot platforms. VLA Fine-Tuning and Specialization and VLA Training Enhancements focus on adapting pretrained models to specific domains or improving training efficiency, whereas Multimodal Representation Enhancement investigates richer sensory fusion strategies. Meanwhile, VLA Efficiency and Deployment Optimization tackles practical constraints such as model compression (e.g., TinyVLA[7]), and Task-Specific VLA Applications examines targeted use cases from manipulation to GUI interaction. Finally, VLA Evaluation, Benchmarking, and Analysis provides the empirical infrastructure needed to measure progress systematically. Within this ecosystem, a particularly active line of work centers on latent action representations and flow-based modeling, where X-VLA[0] resides. These approaches encode actions in continuous latent spaces or leverage normalizing flows to capture multimodal action distributions, contrasting with discrete tokenization schemes prevalent in earlier VLA models. X-VLA[0] emphasizes learning expressive latent action policies that can handle complex, multi-step behaviors across varied embodiments, positioning it alongside methods like XR-1[27], Villa-X[29], and NORA[31], which similarly explore structured latent representations or hierarchical action generation. Compared to more direct end-to-end architectures such as OpenVLA[2] or fine-tuning-centric studies like Fine-tuning VLA[1], X-VLA[0] prioritizes representational flexibility and generalization through its latent formulation. This design choice reflects broader tensions in the field: balancing expressiveness against sample efficiency, and achieving cross-embodiment transfer without sacrificing task-specific performance.

Claimed Contributions

Soft Prompt mechanism for cross-embodiment VLA training

The authors introduce a soft prompt mechanism that assigns learnable embeddings to each data source to capture embodiment-specific features. This approach addresses heterogeneity in cross-embodiment datasets by enabling the model to learn domain-specific hardware configurations with minimal parameter overhead.

10 retrieved papers
X-VLA architecture with soft-prompted Transformer encoders

The authors propose X-VLA, a flow-matching-based VLA architecture built on soft-prompted standard Transformer encoders. The architecture features an enhanced multimodal encoding pipeline that processes multi-view images, language prompts, and proprioceptive features, designed for scalable cross-embodiment training.

10 retrieved papers
Two-phase training pipeline with customized learning recipe

The authors develop a two-phase training methodology comprising heterogeneous pretraining on mixed-embodiment data followed by domain-specific adaptation. The pipeline includes custom learning rates, aligned action representations, intention abstraction through temporal downsampling, and balanced data sampling strategies.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Soft Prompt mechanism for cross-embodiment VLA training

The authors introduce a soft prompt mechanism that assigns learnable embeddings to each data source to capture embodiment-specific features. This approach addresses heterogeneity in cross-embodiment datasets by enabling the model to learn domain-specific hardware configurations with minimal parameter overhead.

Contribution

X-VLA architecture with soft-prompted Transformer encoders

The authors propose X-VLA, a flow-matching-based VLA architecture built on soft-prompted standard Transformer encoders. The architecture features an enhanced multimodal encoding pipeline that processes multi-view images, language prompts, and proprioceptive features, designed for scalable cross-embodiment training.

Contribution

Two-phase training pipeline with customized learning recipe

The authors develop a two-phase training methodology comprising heterogeneous pretraining on mixed-embodiment data followed by domain-specific adaptation. The pipeline includes custom learning rates, aligned action representations, intention abstraction through temporal downsampling, and balanced data sampling strategies.

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model | Novelty Validation