Sim2Real VLA: Zero-Shot Generalization of Synthesized Skills to Realistic Manipulation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Zero-Shot Sim2RealVision-Language-Action (VLA) ModelLong-horizon Manipulation

Vision-Language-Action (VLA) models represent a critical milestone toward embodied intelligence in robotic manipulation. To support their training, recent research has developed high-performance simulation engines for data synthesis. However, their effectiveness is still significantly limited by the simulation-to-reality (Sim2Real) gap, as policies trained on synthetic data often fail to generalize reliably to the real world. To address this challenge, we present Sim2Real-VLA, a generalist robot control model trained exclusively on synthetic data, yet capable of transferring seamlessly to real-world manipulation tasks. Sim2Real-VLA features a dual-system architecture: a high-level planner that infers object-centered chains-of-affordances, and a low-level actor that executes and validates these plans in real time via a tokenized action space. This design filters out manipulation-irrelevant features and prioritizes motion-critical dynamics, thereby enhancing Sim2Real domain transfer. Besides, a notable advantage of Sim2Real-VLA lies in its tight integration with automated data generation for manipulation skills, eliminating the need for manual fine-tuning and enabling scalable, hands-free training. Empirical evaluations across bimanual, dexterous, and long-horizon tasks show that Sim2Real-VLA consistently outperforms previous VLA baselines under diverse real-world environments and domain shifts.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Sim2Real-VLA, a vision-language-action model trained exclusively on synthetic data for zero-shot real-world manipulation. It resides in the Vision-Language-Action Models leaf, which contains only two papers in the entire taxonomy of fifty works. This sparse population suggests VLA approaches remain relatively underexplored within the broader sim-to-real transfer landscape, where most research concentrates on domain randomization, rendering techniques, or reinforcement learning frameworks. The dual-system architecture—combining affordance-driven planning with tokenized low-level control—represents a structural departure from monolithic end-to-end VLA designs.

The taxonomy reveals that neighboring research directions emphasize different transfer mechanisms: domain randomization techniques randomize visual or physical parameters during training, while modular policy architectures decompose tasks hierarchically without language grounding. Foundation model-based planning leverages pretrained vision-language models for high-level reasoning but typically requires separate low-level controllers. Sim2Real-VLA bridges these paradigms by integrating language-conditioned planning with executable action primitives within a unified VLA framework, positioning itself at the intersection of policy learning and knowledge transfer branches rather than purely within simulation construction or adaptation categories.

Among twenty-one candidates examined, the automated data generation pipeline encountered three refutable instances across ten candidates, indicating moderate prior work on synthetic data creation for manipulation. The object-oriented observation adaptation contribution faced two refutations among ten candidates, suggesting existing methods address domain randomization flows or visual adaptation strategies. The core dual-system architecture examined only one candidate without clear refutation, though the limited search scope prevents definitive claims about architectural novelty. The analysis explicitly covers top-K semantic matches and citation expansion, not exhaustive field coverage, meaning additional relevant work may exist beyond this sample.

Given the restricted literature search and the VLA leaf's sparse population, the work appears to occupy a relatively novel position within its immediate taxonomy context. However, the contribution-level statistics reveal that specific technical components—particularly data generation and visual adaptation—have substantial precedent in adjacent research directions. The assessment reflects what twenty-one examined candidates reveal, acknowledging that a broader search might uncover additional overlapping methods in the rapidly evolving VLA and sim-to-real domains.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: zero-shot simulation to reality transfer for robotic manipulation. The field is organized around several complementary branches that together address the challenge of training policies in simulation and deploying them directly on real robots without further real-world data collection. Simulation Environment Construction and Rendering focuses on building high-fidelity virtual worlds, often leveraging photorealistic rendering or neural scene representations such as Gaussian splatting (e.g., Splatsim[14], GSWorld[5]). Domain Randomization and Adaptation Techniques explore strategies to vary simulation parameters—lighting, textures, dynamics—so that policies become robust to distributional shifts. Policy Learning and Transfer Frameworks encompass the algorithmic side, including reinforcement learning pipelines, imitation learning from demonstrations, and increasingly vision-language-action models that ground language instructions in visuomotor control. Specialized Manipulation Domains and Modalities target specific settings like dexterous manipulation or tactile sensing, while Knowledge Transfer and Generalization Mechanisms investigate how learned representations or modular skills can generalize across tasks. System Identification and Real-World Adaptation methods refine simulator parameters or adapt policies using minimal real-world feedback, and Task-Specific Applications and Benchmarks provide standardized testbeds for evaluating transfer success. Recent work has seen a surge in vision-language-action (VLA) models that unify perception, language understanding, and action prediction within a single architecture, aiming for broad generalization across diverse manipulation tasks. Sim2Real VLA[0] exemplifies this trend by training a VLA model entirely in simulation and achieving zero-shot transfer to real-world scenarios, closely aligning with efforts like TinyVLA[13] that explore efficient VLA architectures. In contrast, other lines of research emphasize domain randomization (e.g., Robust visual sim-to-real transfer[10]) or high-fidelity rendering pipelines (High-fidelity simulated data generation[6]) to bridge the reality gap without relying on large-scale language grounding. A key open question is whether VLA models can maintain their generalization advantages when faced with complex contact-rich tasks or whether hybrid approaches combining randomization, realistic rendering, and language grounding will prove more robust. Sim2Real VLA[0] sits within the VLA branch, sharing the ambition of RAM[3] and Zero-Shot Visual Generalization in[2] to leverage pre-trained vision-language priors, yet it distinguishes itself by focusing on pure simulation-based training without any real-world fine-tuning, a stricter zero-shot regime than many contemporaneous efforts.

Claimed Contributions

Sim2Real-VLA: Dual-system architecture with affordance-driven design

1 retrieved paper

The authors propose a VLA model with a dual-system architecture comprising a high-level planner that predicts chains of affordances and a low-level actor that executes these affordances using a tokenized action space. This design filters manipulation-irrelevant features and focuses on motion-critical dynamics to enable zero-shot Sim2Real transfer.

1 retrieved paper

Automated data generation pipeline for manipulation skills

Can Refute

10 retrieved papers

The authors develop an automated pipeline that generates training data for manipulation skills without manual intervention. This pipeline includes Real2Sim projection, generative scene scaling, and automatic skill acquisition, enabling scalable training exclusively from simulated data.

10 retrieved papers

Can Refute

Object-oriented observation adaptation with domain randomization flows

Can Refute

10 retrieved papers

The authors introduce an object-oriented adaptation mechanism that recovers object masks from visual observations and applies strategic domain randomization flows across action-invariant features. This approach helps the model focus on task-relevant dynamics while filtering out manipulation-irrelevant variations.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[13] TinyVLA: Toward Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation PDF

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Ya-Xin Peng, Fei-Fei Feng, Yaxin Peng, Jian Tang, Feifei Feng (2024) • IEEE Robotics and Automation Letters

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Sim2Real-VLA: Dual-system architecture with affordance-driven design

[67] Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion PDF

Cannot Refute

Contribution

Automated data generation pipeline for manipulation skills

[54] Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning PDF

Can Refute

[56] Generative artificial intelligence in robotic manipulation: A survey PDF

Can Refute

[59] Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation PDF

Can Refute

[51] 6DoF assembly pose estimation dataset for robotic manipulation PDF

Cannot Refute

[52] Toward synthetic data generation for robotic tactile manipulations PDF

Cannot Refute

[53] Is an object-centric representation beneficial for robotic manipulation ? PDF

Cannot Refute

[55] Sim-and-real co-training: A simple recipe for vision-based robotic manipulation PDF

Cannot Refute

[57] Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation PDF

Cannot Refute

[58] Mimicgen: A data generation system for scalable robot learning using human demonstrations PDF

Cannot Refute

[60] SimLiquid: A SimulationâBased Liquid Perception Pipeline for Robot Liquid Manipulation PDF

Cannot Refute

Contribution

Object-oriented observation adaptation with domain randomization flows

[59] Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation PDF

Can Refute

[62] Using simulation and domain adaptation to improve efficiency of deep robotic grasping PDF

Can Refute

[2] Zero-Shot Visual Generalization in Robot Manipulation PDF

Cannot Refute

[10] Robust visual sim-to-real transfer for robotic manipulation PDF

Cannot Refute

[18] Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks PDF

Cannot Refute

[61] A Distributional Treatment of Real2Sim2Real for Object-Centric Agent Adaptation in Vision-Driven DLO Manipulation PDF

Cannot Refute

[63] Visuomotor grasping with world models for surgical robots PDF

Cannot Refute

[64] Domain adaptation of visual policies with a single demonstration PDF

Cannot Refute

[65] One view, many worlds: Single-image to 3d object meets generative domain randomization for one-shot 6d pose estimation PDF

Cannot Refute

[66] Sim-to-real transfer in deep reinforcement learning for robotics: a survey PDF

Cannot Refute

Sim2Real VLA: Zero-Shot Generalization of Synthesized Skills to Realistic Manipulation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[13] TinyVLA: Toward Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation PDF

Contribution Analysis

Sim2Real-VLA: Dual-system architecture with affordance-driven design

[67] Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion PDF

Automated data generation pipeline for manipulation skills

[54] Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning PDF

[56] Generative artificial intelligence in robotic manipulation: A survey PDF

[59] Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation PDF

[51] 6DoF assembly pose estimation dataset for robotic manipulation PDF

[52] Toward synthetic data generation for robotic tactile manipulations PDF

[53] Is an object-centric representation beneficial for robotic manipulation ? PDF

[55] Sim-and-real co-training: A simple recipe for vision-based robotic manipulation PDF

[57] Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation PDF

[58] Mimicgen: A data generation system for scalable robot learning using human demonstrations PDF

[60] SimLiquid: A SimulationâBased Liquid Perception Pipeline for Robot Liquid Manipulation PDF

Object-oriented observation adaptation with domain randomization flows

[59] Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation PDF

[62] Using simulation and domain adaptation to improve efficiency of deep robotic grasping PDF

[2] Zero-Shot Visual Generalization in Robot Manipulation PDF

[10] Robust visual sim-to-real transfer for robotic manipulation PDF

[18] Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks PDF

[61] A Distributional Treatment of Real2Sim2Real for Object-Centric Agent Adaptation in Vision-Driven DLO Manipulation PDF

[63] Visuomotor grasping with world models for surgical robots PDF

[64] Domain adaptation of visual policies with a single demonstration PDF

[65] One view, many worlds: Single-image to 3d object meets generative domain randomization for one-shot 6d pose estimation PDF

[66] Sim-to-real transfer in deep reinforcement learning for robotics: a survey PDF

Table of Contents

[60] SimLiquid: A SimulationâBased Liquid Perception Pipeline for Robot Liquid Manipulation PDF