Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

World Action Model; Embodied AI; Vision-language-action; Robotic Manipulation

We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that jointly learns visual representations and action policies within a single video-generative framework. At its core, GE-Base is a large-scale instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Building on this foundation, GE-Act employs a lightweight flow-matching decoder to map latent representations into executable action trajectories, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. Trained on over 1 million manipulation episodes, GE supports both short- and long-horizon tasks, and generalizes across embodiments. All code, models, and benchmarks will be released publicly.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Genie Envisioner proposes a unified world foundation platform combining instruction-conditioned video diffusion (GE-Base) with a flow-matching action decoder (GE-Act) for robotic manipulation. The paper resides in the Diffusion-Based Video Generation leaf, which contains thirteen papers—the most populated branch in the taxonomy. This crowded research direction reflects intense activity in applying diffusion models to robotic video synthesis, with sibling works like TrackDiffusion and RoboEnvision exploring similar architectural paradigms. The high density suggests that diffusion-based approaches have become a dominant framework for video-generative world modeling in manipulation contexts.

The taxonomy reveals neighboring branches addressing complementary challenges: Autoregressive and Flow-Based Generation (two papers) explores alternative generative mechanisms, while Large-Scale Pre-Training and Foundation Models (three papers) emphasizes scaling strategies. Geometric-Aware and 3D-Consistent Modeling branches (nine papers across three leaves) focus on spatial reasoning that diffusion-based methods often lack. Genie Envisioner bridges these directions by combining diffusion-based video synthesis with flow-matching for action decoding, positioning itself at the intersection of generative architectures and policy learning. The taxonomy's scope notes clarify that this leaf excludes downstream policy extraction (covered under Policy Learning) and geometric reconstruction (under Geometric-Aware Modeling).

Among thirty candidates examined, the analysis found limited prior work overlap. The unified platform contribution (Contribution A) and instruction-conditioned diffusion model (Contribution B) each examined ten candidates with zero refutable matches, suggesting relative novelty within the search scope. The flow-matching action decoder (Contribution C) identified one refutable candidate among ten examined, indicating some precedent for parallel action prediction architectures. These statistics reflect a focused semantic search rather than exhaustive coverage; the absence of refutations does not guarantee absolute novelty but suggests the work occupies a less-explored niche within the examined literature.

Based on the limited search scope of thirty semantically similar papers, Genie Envisioner appears to introduce a distinctive integration of instruction-conditioned diffusion and flow-matching action decoding. The crowded taxonomy leaf indicates a competitive research area, yet the low refutation rate suggests the specific architectural combination and unified platform framing may offer incremental differentiation. The analysis does not cover broader policy learning literature or recent preprints outside the top-thirty matches, leaving open questions about overlap with concurrent foundation model efforts.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: robotic manipulation through video-generative world modeling. The field centers on leveraging generative video models to predict future visual states and guide robotic policies, bridging perception and action in manipulation tasks. The taxonomy reveals several complementary branches: Video Generation Architectures and Training Paradigms explore foundational techniques such as diffusion-based methods (e.g., Genie Envisioner[0], TrackDiffusion[6]) and autoregressive approaches; Geometric-Aware and 3D-Consistent Modeling emphasizes spatial reasoning and multi-view consistency (e.g., ManiGaussian[4], Geometry-aware Video Generation[7]); Controllability Mechanisms and Conditioning address how to steer generated videos via actions, trajectories, or language; Policy Learning from Generated Videos investigates how to extract executable behaviors from synthetic rollouts (e.g., Imitating Generated Videos[12], Video2policy[14]); Data Generation and Augmentation focuses on scaling training datasets; Evaluation and Benchmarking provides metrics and testbeds (e.g., WorldSimBench[1]); and Survey and Conceptual Frameworks offer high-level perspectives. Together, these branches form a pipeline from video synthesis to policy deployment, with ongoing interplay between generative quality, physical plausibility, and downstream task performance. Recent work highlights contrasts between purely visual generation and geometry-grounded modeling, as well as trade-offs between model expressiveness and computational cost. Diffusion-based approaches like Genie Envisioner[0] and RoboEnvision[8] prioritize high-fidelity video synthesis and flexible conditioning, often enabling rich action-conditioned rollouts that can inform policy learning. In contrast, methods such as ManipDreamer3D[16] and Pretrained Video Simulators[9] emphasize 3D consistency or leverage large-scale pretraining to improve generalization. Genie Envisioner[0] sits within the diffusion-based video generation cluster, sharing architectural themes with TrackDiffusion[6] and RoboEnvision[8], yet it distinguishes itself by integrating trajectory-level control and targeting manipulation-specific scenarios. Compared to Collaborative Trajectory Control[13], which focuses on multi-agent coordination, Genie Envisioner[0] emphasizes single-agent fidelity and action-conditioned prediction. Open questions remain around scaling to diverse real-world environments, ensuring physical realism, and efficiently transferring learned world models to deployable policies.

Claimed Contributions

Genie Envisioner unified world foundation platform

10 retrieved papers

The authors propose a unified platform that integrates robotic world generation and manipulation policy learning in a single video-generative framework, combining visual representation learning with action policy learning for robotic manipulation tasks.

10 retrieved papers

GE-Base instruction-conditioned video diffusion model

10 retrieved papers

The authors introduce a large-scale video diffusion model that encodes spatial, temporal, and semantic structure of robotic interactions through multi-view egocentric video generation with cross-view consistency, trained on over 1 million manipulation episodes.

10 retrieved papers

GE-Act parallel world action module with flow-matching decoder

Can Refute

10 retrieved papers

The authors develop a lightweight parallel action module that is block-wise aligned with GE-Base and directly accesses multi-scale latent features to produce action trajectories, enabling real-time control and cross-embodiment generalization with minimal task-specific data.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models PDF

Pengxiang Li, Kai Chen, Zhili Liu, Ruiyuan Gao, Lanqing Hong, Dit-Yan Yeung, Huchuan Lu, Guo Zhou, Xu Jia, Hua Yao (2025)

[8] RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation PDF

Yang Liu-di, Bai, Yang, Liudi Yang, Eskandar, George, Yang Bai, Shen, Fengyi, G. Eskandar, Altillawi, Mohammad, Fengyi Shen, Chen Dong, Mohammad Altillawi, Majumder, Soumajit, Dong Chen, Liu, Ziyuan, Soumajit Majumder, Kutyniok, Gitta, Ziyuan Liu, Valada, Abhinav, Gitta Kutyniok, Abhinav Valada (2025)

[9] Pre-trained video generative models as world simulators PDF

He, Haoran, Zhang Yang, Haoran He, Lin Liang, Yang Zhang, Xu, Zhongwen, Liang Lin, Pan Ling, Zhongwen Xu, Ling Pan (2025)

[13] Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control PDF

Fu Xiao, Wang, Xintao, Xiao Fu, Liu Xian, Xintao Wang, Bai Jian-hong, Xian Liu, Xu, Runsen, Jianhong Bai, Wan Pengfei, Runsen Xu, Zhang Di, Pengfei Wan, Lin, Dahua, Di Zhang, Dahua Lin (2025)

[16] Manipdreamer3d: Synthesizing plausible robotic manipulation video with occupancy-aware 3d trajectory PDF

Li Ying, Wei Xiaobao, Ying Li, Chi Xiao-wei, Xiaobao Wei, Li Yuming, Xiaowei Chi, Zhao Zhongyu, Yuming Li, Wang, Hao, Zhongyu Zhao, Ma, Ningning, Hao Wang, Lu Ming, Ningning Ma, Zhang, Shanghang, Ming Lu, Shanghang Zhang (2025)

[20] ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance PDF

[23] Mask2IV: Interaction-Centric Video Generation via Mask Trajectories PDF

Li Gen, Zhao Bo, Gen Li, Yang Jianfei, Bo Zhao, Sevilla-Lara, Laura, Jianfei Yang, Laura Sevilla-Lara (2025) • arXiv.org

[26] Robodreamer: Learning compositional world models for robot imagination PDF

Zhou, Siyuan, Du, Yilun, Siyuan Zhou, Chen, Jiaben, Yilun Du, Li, Yandong, Jiaben Chen, Yeung, Dit-Yan, Yandong Li, Gan, Chuang, D. Yeung, Chuang Gan (2024)

[40] Vid2World: Crafting Video Diffusion Models to Interactive World Models PDF

Wu Jialong, Siqiao Huang, Zhou Qixing, Jialong Wu, Qixing Zhou, Long, Mingsheng, Shangchen Miao, Mingsheng Long (2025)

[42] Learning Universal Policies via Text-Guided Video Generation PDF

Du, Yilun, Yang Mengjiao, Yilun Du, Dai Bo, Mengjiao Yang, Dai, Hanjun, Bo Dai, Nachum, Ofir, H. Dai, Tenenbaum, Joshua B., Ofir Nachum, Schuurmans, Dale, J. Tenenbaum, Abbeel, Pieter, D. Schuurmans, P. Abbeel (2023)

[47] World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation PDF

Jiang Zhennan, Liu Kai, Qin Yuxin, Tian Shuai, Zheng Yu-peng, Zhou, Mingcai, Yu Chao, Li, Haoran, Zhao Dongbin (2025)

[50] Time-Correlated Video Bridge Matching PDF

Vasilev, Viacheslav, Ivanov, Arseny, Viacheslav Vasilev, Gushchin, Nikita, Arseny Ivanov, Kovaleva Maria, Nikita Gushchin, Korotin, Alexander, Maria Kovaleva, Alexander Korotin (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Genie Envisioner unified world foundation platform

[13] Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control PDF

Cannot Refute

[26] Robodreamer: Learning compositional world models for robot imagination PDF

Cannot Refute

[36] DreamGen: Unlocking Generalization in Robot Learning through Video World Models PDF

Cannot Refute

[49] Enerverse: Envisioning embodied future space for robotics manipulation PDF

Cannot Refute

[69] Gr00t n1: An open foundation model for generalist humanoid robots PDF

Cannot Refute

[70] Gwm: Towards scalable gaussian world models for robotic manipulation PDF

Cannot Refute

[71] Generative artificial intelligence in robotic manipulation: A survey PDF

Cannot Refute

[72] General-purpose foundation models for increased autonomy in robot-assisted surgery PDF

Cannot Refute

[73] Structured World Models from Human Videos PDF

Cannot Refute

[74] IRASim: A Fine-Grained World Model for Robot Manipulation PDF

Cannot Refute

Contribution

GE-Base instruction-conditioned video diffusion model

[8] RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation PDF

Cannot Refute

[16] Manipdreamer3d: Synthesizing plausible robotic manipulation video with occupancy-aware 3d trajectory PDF

Cannot Refute

[61] 3d-vla: A 3d vision-language-action generative world model PDF

Cannot Refute

[62] Disco: Language-guided manipulation with diffusion policies and constrained inpainting PDF

Cannot Refute

[63] Seer: Language Instructed Video Prediction with Latent Diffusion Models PDF

Cannot Refute

[64] Zero-shot robotic manipulation with pretrained image-editing diffusion models PDF

Cannot Refute

[65] COLLAGE: Collaborative human-agent interaction generation using hierarchical latent diffusion and language models PDF

Cannot Refute

[66] Gigaworld-0: World models as data engine to empower embodied ai PDF

Cannot Refute

[67] From language to locomotion: Retargeting-free humanoid control via motion latent guidance PDF

Cannot Refute

[68] CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion PDF

Cannot Refute

Contribution

GE-Act parallel world action module with flow-matching decoder

[51] Ï0: A Vision-Language-Action Flow Model for General Robot Control PDF

Can Refute

[52] Streaming Flow Policy: Simplifying diffusionflow-matching policies by treating action trajectories as flow trajectories PDF

Cannot Refute

[53] Vfp: Variational flow-matching policy for multi-modal robot manipulation PDF

Cannot Refute

[54] Motion manifold flow primitives for task-conditioned trajectory generation under complex task-motion dependencies PDF

Cannot Refute

[55] Flownav: Combining flow matching and depth priors for efficient navigation PDF

Cannot Refute

[56] Riemannian Flow Matching Policy for Robot Motion Learning PDF

Cannot Refute

[57] Maniflow: A general robot manipulation policy via consistency flow training PDF

Cannot Refute

[58] Diffusion Models in Robotics: A Survey PDF

Cannot Refute

[59] FlowRAM: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic Manipulation PDF

Cannot Refute

[60] GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving PDF

Cannot Refute

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models PDF

[8] RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation PDF

[9] Pre-trained video generative models as world simulators PDF

[13] Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control PDF

[16] Manipdreamer3d: Synthesizing plausible robotic manipulation video with occupancy-aware 3d trajectory PDF

[20] ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance PDF

[23] Mask2IV: Interaction-Centric Video Generation via Mask Trajectories PDF

[26] Robodreamer: Learning compositional world models for robot imagination PDF

[40] Vid2World: Crafting Video Diffusion Models to Interactive World Models PDF

[42] Learning Universal Policies via Text-Guided Video Generation PDF

[47] World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation PDF

[50] Time-Correlated Video Bridge Matching PDF

Contribution Analysis

Genie Envisioner unified world foundation platform

[13] Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control PDF

[26] Robodreamer: Learning compositional world models for robot imagination PDF

[36] DreamGen: Unlocking Generalization in Robot Learning through Video World Models PDF

[49] Enerverse: Envisioning embodied future space for robotics manipulation PDF

[69] Gr00t n1: An open foundation model for generalist humanoid robots PDF

[70] Gwm: Towards scalable gaussian world models for robotic manipulation PDF

[71] Generative artificial intelligence in robotic manipulation: A survey PDF

[72] General-purpose foundation models for increased autonomy in robot-assisted surgery PDF

[73] Structured World Models from Human Videos PDF

[74] IRASim: A Fine-Grained World Model for Robot Manipulation PDF

GE-Base instruction-conditioned video diffusion model

[8] RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation PDF

[16] Manipdreamer3d: Synthesizing plausible robotic manipulation video with occupancy-aware 3d trajectory PDF

[61] 3d-vla: A 3d vision-language-action generative world model PDF

[62] Disco: Language-guided manipulation with diffusion policies and constrained inpainting PDF

[63] Seer: Language Instructed Video Prediction with Latent Diffusion Models PDF

[64] Zero-shot robotic manipulation with pretrained image-editing diffusion models PDF

[65] COLLAGE: Collaborative human-agent interaction generation using hierarchical latent diffusion and language models PDF

[66] Gigaworld-0: World models as data engine to empower embodied ai PDF

[67] From language to locomotion: Retargeting-free humanoid control via motion latent guidance PDF

[68] CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion PDF

GE-Act parallel world action module with flow-matching decoder

[51] Ï0: A Vision-Language-Action Flow Model for General Robot Control PDF

[52] Streaming Flow Policy: Simplifying diffusionflow-matching policies by treating action trajectories as flow trajectories PDF

[53] Vfp: Variational flow-matching policy for multi-modal robot manipulation PDF

[54] Motion manifold flow primitives for task-conditioned trajectory generation under complex task-motion dependencies PDF

[55] Flownav: Combining flow matching and depth priors for efficient navigation PDF

[56] Riemannian Flow Matching Policy for Robot Motion Learning PDF

[57] Maniflow: A general robot manipulation policy via consistency flow training PDF

[58] Diffusion Models in Robotics: A Survey PDF

[59] FlowRAM: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic Manipulation PDF

[60] GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving PDF

Table of Contents

[51] Ï0: A Vision-Language-Action Flow Model for General Robot Control PDF