Sim2Real VLA: Zero-Shot Generalization of Synthesized Skills to Realistic Manipulation

ICLR 2026 Conference SubmissionAnonymous Authors
Zero-Shot Sim2RealVision-Language-Action (VLA) ModelLong-horizon Manipulation
Abstract:

Vision-Language-Action (VLA) models represent a critical milestone toward embodied intelligence in robotic manipulation. To support their training, recent research has developed high-performance simulation engines for data synthesis. However, their effectiveness is still significantly limited by the simulation-to-reality (Sim2Real) gap, as policies trained on synthetic data often fail to generalize reliably to the real world. To address this challenge, we present Sim2Real-VLA, a generalist robot control model trained exclusively on synthetic data, yet capable of transferring seamlessly to real-world manipulation tasks. Sim2Real-VLA features a dual-system architecture: a high-level planner that infers object-centered chains-of-affordances, and a low-level actor that executes and validates these plans in real time via a tokenized action space. This design filters out manipulation-irrelevant features and prioritizes motion-critical dynamics, thereby enhancing Sim2Real domain transfer. Besides, a notable advantage of Sim2Real-VLA lies in its tight integration with automated data generation for manipulation skills, eliminating the need for manual fine-tuning and enabling scalable, hands-free training. Empirical evaluations across bimanual, dexterous, and long-horizon tasks show that Sim2Real-VLA consistently outperforms previous VLA baselines under diverse real-world environments and domain shifts.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Sim2Real-VLA, a vision-language-action model trained exclusively on synthetic data for zero-shot real-world manipulation. It resides in the Vision-Language-Action Models leaf, which contains only two papers in the entire taxonomy of fifty works. This sparse population suggests VLA approaches remain relatively underexplored within the broader sim-to-real transfer landscape, where most research concentrates on domain randomization, rendering techniques, or reinforcement learning frameworks. The dual-system architecture—combining affordance-driven planning with tokenized low-level control—represents a structural departure from monolithic end-to-end VLA designs.

The taxonomy reveals that neighboring research directions emphasize different transfer mechanisms: domain randomization techniques randomize visual or physical parameters during training, while modular policy architectures decompose tasks hierarchically without language grounding. Foundation model-based planning leverages pretrained vision-language models for high-level reasoning but typically requires separate low-level controllers. Sim2Real-VLA bridges these paradigms by integrating language-conditioned planning with executable action primitives within a unified VLA framework, positioning itself at the intersection of policy learning and knowledge transfer branches rather than purely within simulation construction or adaptation categories.

Among twenty-one candidates examined, the automated data generation pipeline encountered three refutable instances across ten candidates, indicating moderate prior work on synthetic data creation for manipulation. The object-oriented observation adaptation contribution faced two refutations among ten candidates, suggesting existing methods address domain randomization flows or visual adaptation strategies. The core dual-system architecture examined only one candidate without clear refutation, though the limited search scope prevents definitive claims about architectural novelty. The analysis explicitly covers top-K semantic matches and citation expansion, not exhaustive field coverage, meaning additional relevant work may exist beyond this sample.

Given the restricted literature search and the VLA leaf's sparse population, the work appears to occupy a relatively novel position within its immediate taxonomy context. However, the contribution-level statistics reveal that specific technical components—particularly data generation and visual adaptation—have substantial precedent in adjacent research directions. The assessment reflects what twenty-one examined candidates reveal, acknowledging that a broader search might uncover additional overlapping methods in the rapidly evolving VLA and sim-to-real domains.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
21
Contribution Candidate Papers Compared
5
Refutable Paper

Research Landscape Overview

Core task: zero-shot simulation to reality transfer for robotic manipulation. The field is organized around several complementary branches that together address the challenge of training policies in simulation and deploying them directly on real robots without further real-world data collection. Simulation Environment Construction and Rendering focuses on building high-fidelity virtual worlds, often leveraging photorealistic rendering or neural scene representations such as Gaussian splatting (e.g., Splatsim[14], GSWorld[5]). Domain Randomization and Adaptation Techniques explore strategies to vary simulation parameters—lighting, textures, dynamics—so that policies become robust to distributional shifts. Policy Learning and Transfer Frameworks encompass the algorithmic side, including reinforcement learning pipelines, imitation learning from demonstrations, and increasingly vision-language-action models that ground language instructions in visuomotor control. Specialized Manipulation Domains and Modalities target specific settings like dexterous manipulation or tactile sensing, while Knowledge Transfer and Generalization Mechanisms investigate how learned representations or modular skills can generalize across tasks. System Identification and Real-World Adaptation methods refine simulator parameters or adapt policies using minimal real-world feedback, and Task-Specific Applications and Benchmarks provide standardized testbeds for evaluating transfer success. Recent work has seen a surge in vision-language-action (VLA) models that unify perception, language understanding, and action prediction within a single architecture, aiming for broad generalization across diverse manipulation tasks. Sim2Real VLA[0] exemplifies this trend by training a VLA model entirely in simulation and achieving zero-shot transfer to real-world scenarios, closely aligning with efforts like TinyVLA[13] that explore efficient VLA architectures. In contrast, other lines of research emphasize domain randomization (e.g., Robust visual sim-to-real transfer[10]) or high-fidelity rendering pipelines (High-fidelity simulated data generation[6]) to bridge the reality gap without relying on large-scale language grounding. A key open question is whether VLA models can maintain their generalization advantages when faced with complex contact-rich tasks or whether hybrid approaches combining randomization, realistic rendering, and language grounding will prove more robust. Sim2Real VLA[0] sits within the VLA branch, sharing the ambition of RAM[3] and Zero-Shot Visual Generalization in[2] to leverage pre-trained vision-language priors, yet it distinguishes itself by focusing on pure simulation-based training without any real-world fine-tuning, a stricter zero-shot regime than many contemporaneous efforts.

Claimed Contributions

Sim2Real-VLA: Dual-system architecture with affordance-driven design

The authors propose a VLA model with a dual-system architecture comprising a high-level planner that predicts chains of affordances and a low-level actor that executes these affordances using a tokenized action space. This design filters manipulation-irrelevant features and focuses on motion-critical dynamics to enable zero-shot Sim2Real transfer.

1 retrieved paper
Automated data generation pipeline for manipulation skills

The authors develop an automated pipeline that generates training data for manipulation skills without manual intervention. This pipeline includes Real2Sim projection, generative scene scaling, and automatic skill acquisition, enabling scalable training exclusively from simulated data.

10 retrieved papers
Can Refute
Object-oriented observation adaptation with domain randomization flows

The authors introduce an object-oriented adaptation mechanism that recovers object masks from visual observations and applies strategic domain randomization flows across action-invariant features. This approach helps the model focus on task-relevant dynamics while filtering out manipulation-irrelevant variations.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Sim2Real-VLA: Dual-system architecture with affordance-driven design

The authors propose a VLA model with a dual-system architecture comprising a high-level planner that predicts chains of affordances and a low-level actor that executes these affordances using a tokenized action space. This design filters manipulation-irrelevant features and focuses on motion-critical dynamics to enable zero-shot Sim2Real transfer.

Contribution

Automated data generation pipeline for manipulation skills

The authors develop an automated pipeline that generates training data for manipulation skills without manual intervention. This pipeline includes Real2Sim projection, generative scene scaling, and automatic skill acquisition, enabling scalable training exclusively from simulated data.

Contribution

Object-oriented observation adaptation with domain randomization flows

The authors introduce an object-oriented adaptation mechanism that recovers object masks from visual observations and applies strategic domain randomization flows across action-invariant features. This approach helps the model focus on task-relevant dynamics while filtering out manipulation-irrelevant variations.