Ctrl-World: A Controllable Generative World Model for Robot Manipulation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

World ModelVision-Language-Action Model (VLA)

Generalist robot policies can now perform a wide range of manipulation skills, but evaluating and improving their ability with unfamiliar objects and instructions remains a significant challenge. Rigorous evaluation requires a large number of real-world rollouts, while systematic improvement demands additional corrective data with expert labels. Both of these processes are slow, costly, and difficult to scale. World models offer a promising, scalable alternative by enabling policies to rollout within imagination space. However, a key challenge is building a controllable world model that can handle multi-step interactions with generalist robot policies. This requires a world model compatible with modern generalist policies by supporting multi-view prediction, fine-grained action control, and consistent long-horizon interactions, which is not achieved by previous works. In this paper, we make a step forward by introducing a controllable multi-view world model that can be used to evaluate and improve the instruction-following ability of generalist robot policies. Our model maintains long-horizon consistency with a pose-conditioned memory retrieval mechanism and achieves precise action control through frame-level action conditioning. Trained on the DROID dataset (95k trajectories, 564 scenes), our model generates spatially and temporally consistent trajectories under novel scenarios and new camera placements for over 20 seconds. We show that our method can accurately rank policy performance without real-world robot rollouts. Moreover, by synthesizing successful trajectories in imagination and using them for supervised fine-tuning, our approach can improve policy success by 44.7%. Videos can be found at https://sites.google.com/view/ctrl-world.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Ctrl-World, a controllable multi-view world model designed to evaluate and improve generalist robot policies through imagination-based rollouts. It resides in the 'Evaluation and Controllability of World Models' leaf, which contains only two papers in the entire taxonomy of fifty works. This sparse population suggests the paper targets a relatively underexplored niche within the broader world model landscape, focusing specifically on rigorous evaluation and controllability rather than architectural innovation or training paradigms that dominate other branches.

The taxonomy reveals that most world model research clusters around architecture design (object-centric, geometric, multi-view representations), training methods (diffusion models, pretraining, data augmentation), and control frameworks (MPC, reinforcement learning, search-based planning). Ctrl-World's emphasis on multi-view prediction and pose-conditioned memory retrieval connects it to the 'Multi-View and View-Invariant Representations' leaf, yet its core contribution diverges by prioritizing downstream controllability and policy evaluation over representation learning alone. Neighboring branches like 'Model Predictive Control Frameworks' and 'Reinforcement Learning with World Models' address control but typically assume model quality rather than systematically evaluating it.

Among twenty-nine candidates examined, the first contribution (controllable multi-view world model) encountered two refutable candidates out of ten examined, indicating some prior work addresses similar multi-view controllability challenges. The second contribution (imagination-based policy evaluation) and third contribution (policy improvement via synthetic trajectories) each examined nine and ten candidates respectively, with zero refutations found. This suggests that while the core world model architecture has partial precedent within the limited search scope, the specific applications to policy evaluation and improvement appear less directly anticipated by the examined literature.

Given the limited search scope of twenty-nine candidates and the sparse two-paper leaf in which the work resides, the analysis captures a snapshot rather than exhaustive coverage. The controllable multi-view model shows some overlap with prior efforts, but the integration of evaluation and improvement workflows for generalist policies appears less saturated. The taxonomy structure indicates this evaluation-centric angle remains relatively underpopulated compared to architecture and training research directions.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Controllable world model for robot manipulation. The field organizes around several major branches that reflect different facets of building and deploying predictive models for robotic systems. World Model Architecture and Representation explores how to structure internal models, ranging from object-centric approaches like FOCUS[1] and Gaussian World Models[2] to particle-based representations such as ParticleFormer[20]. World Model Training and Learning Paradigms addresses how these models acquire knowledge, including pre-training strategies like those in Pre-training World Models[42] and masked prediction methods exemplified by Masked World Models[35]. Control and Planning with World Models focuses on leveraging learned dynamics for decision-making, with works like Deep Koopman MPC[34] and Hierarchical Task MPC[21] demonstrating model-predictive control frameworks. Vision-Language-Action Models and Foundation Models examines large-scale multimodal systems such as Goal-VLA[22] and RoboEngine[27], while Specialized World Model Applications and Domains targets specific settings like dexterous manipulation in DexSim2Real[26] or interactive digital twins in Interactive Digital Twins[28]. Model Learning and Identification Methods investigates system identification techniques, and Evaluation and Controllability of World Models scrutinizes how well these models support reliable control. Recent work highlights tensions between generality and task-specific fidelity, with foundation models like RoboHorizon[10] and World4Omni[25] pursuing broad applicability, while specialized approaches such as IRASim[3] and FlowDreamer[16] emphasize domain-tailored accuracy. A key open question is how to balance model complexity with sample efficiency and real-world transferability, as seen in comparisons like Model-Based vs Free[29]. Ctrl-World[0] situates itself within the Evaluation and Controllability branch, emphasizing rigorous assessment of how world models enable precise manipulation control. This focus contrasts with purely architectural innovations like View-invariant World Models[4] or training paradigm shifts in MoDem-V2[7], instead prioritizing the downstream controllability and reliability that determine whether a learned model can safely guide real robotic actions. By concentrating on evaluation metrics and controllability guarantees, Ctrl-World[0] addresses a critical gap in ensuring that predictive models translate into robust manipulation performance.

Claimed Contributions

Ctrl-World: A controllable multi-view world model for robot manipulation

Can Refute

10 retrieved papers

The authors propose a controllable world model that supports multi-view prediction, fine-grained action control via frame-level conditioning, and long-horizon consistency through pose-conditioned memory retrieval. This model enables generalist robot policies to perform rollouts in imagination space for both evaluation and improvement.

10 retrieved papers

Can Refute

Imagination-based policy evaluation method

9 retrieved papers

The authors demonstrate that their world model can accurately evaluate generalist robot policies by performing rollouts in imagination space, with evaluation results that align with real-world policy performance rankings without requiring actual robot rollouts.

9 retrieved papers

Policy improvement through synthetic trajectory generation

10 retrieved papers

The authors show that their world model can generate synthetic successful trajectories entirely in imagination, which can then be used to fine-tune generalist policies via supervised learning, significantly improving their instruction-following capabilities on novel tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[50] Survey of the current practices and challenges for vision systems in industrial robotic grasping and assembly applications PDF

H Alzarok, S Fletcher, AP Longstaff (2020)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Ctrl-World: A controllable multi-view world model for robot manipulation

[3] IRASim: A Fine-Grained World Model for Robot Manipulation PDF

Can Refute

[69] Enerverse-ac: Envisioning embodied environments with action condition PDF

Can Refute

[5] Multi-View Masked World Models for Visual Robotic Manipulation PDF

Cannot Refute

[9] imowm: Taming interactive multi-modal world model for robotic manipulation PDF

Cannot Refute

[10] RoboHorizon: An LLM-Assisted Multi-View World Model for Long-Horizon Robotic Manipulation PDF

Cannot Refute

[70] 3d-mvp: 3d multiview pretraining for robotic manipulation PDF

Cannot Refute

[71] Gigaworld-0: World models as data engine to empower embodied ai PDF

Cannot Refute

[72] Multi-view dreaming: multi-view world model with contrastive learning PDF

Cannot Refute

[73] Towards High-Consistency Embodied World Model with Multi-View Trajectory Videos PDF

Cannot Refute

[74] LoLA: Long Horizon Latent Action Learning for General Robot Manipulation PDF

Cannot Refute

Contribution

Imagination-based policy evaluation method

[23] Dream to manipulate: Compositional world models empowering robot imitation learning with imagination PDF

Cannot Refute

[51] Rig: Synergizing reasoning and imagination in end-to-end generalist policy PDF

Cannot Refute

[53] Discovering and Achieving Goals via World Models PDF

Cannot Refute

[54] Adapting World Models with Latent-State Dynamics Residuals PDF

Cannot Refute

[55] Sparse Imagination for Efficient Visual World Model Planning PDF

Cannot Refute

[56] Transferring policy of deep reinforcement learning from simulation to reality for robotics PDF

Cannot Refute

[57] Unifying Modern AI with Robotics: Survey on MDPs with Diffusion and Foundation Models PDF

Cannot Refute

[58] Generative World-Model Planning for Long-Horizon User Preference Evolution and Responsible Personalization PDF

Cannot Refute

[59] VL-SAFE: Vision-Language Guided Safety-Aware Reinforcement Learning with World Models for Autonomous Driving PDF

Cannot Refute

Contribution

Policy improvement through synthetic trajectory generation

[57] Unifying Modern AI with Robotics: Survey on MDPs with Diffusion and Foundation Models PDF

Cannot Refute

[60] Voxposer: Composable 3d value maps for robotic manipulation with language models PDF

Cannot Refute

[61] Motiondiffuser: Controllable multi-agent motion prediction using diffusion PDF

Cannot Refute

[62] A survey on diffusion policy for robotic manipulation: Taxonomy, analysis, and future directions PDF

Cannot Refute

[63] RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches PDF

Cannot Refute

[64] Pirlnav: Pretraining with imitation and rl finetuning for objectnav PDF

Cannot Refute

[65] LLaRA: Supercharging Robot Learning Data for Vision-Language Policy PDF

Cannot Refute

[66] Learning generalizable manipulation policy with adapter-based parameter fine-tuning PDF

Cannot Refute

[67] Improving agent behaviors with rl fine-tuning for autonomous driving PDF

Cannot Refute

[68] Foundation models in robotics PDF

Cannot Refute

Ctrl-World: A Controllable Generative World Model for Robot Manipulation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[50] Survey of the current practices and challenges for vision systems in industrial robotic grasping and assembly applications PDF

Contribution Analysis

Ctrl-World: A controllable multi-view world model for robot manipulation

[3] IRASim: A Fine-Grained World Model for Robot Manipulation PDF

[69] Enerverse-ac: Envisioning embodied environments with action condition PDF

[5] Multi-View Masked World Models for Visual Robotic Manipulation PDF

[9] imowm: Taming interactive multi-modal world model for robotic manipulation PDF

[10] RoboHorizon: An LLM-Assisted Multi-View World Model for Long-Horizon Robotic Manipulation PDF

[70] 3d-mvp: 3d multiview pretraining for robotic manipulation PDF

[71] Gigaworld-0: World models as data engine to empower embodied ai PDF

[72] Multi-view dreaming: multi-view world model with contrastive learning PDF

[73] Towards High-Consistency Embodied World Model with Multi-View Trajectory Videos PDF

[74] LoLA: Long Horizon Latent Action Learning for General Robot Manipulation PDF

Imagination-based policy evaluation method

[23] Dream to manipulate: Compositional world models empowering robot imitation learning with imagination PDF

[51] Rig: Synergizing reasoning and imagination in end-to-end generalist policy PDF

[53] Discovering and Achieving Goals via World Models PDF

[54] Adapting World Models with Latent-State Dynamics Residuals PDF

[55] Sparse Imagination for Efficient Visual World Model Planning PDF

[56] Transferring policy of deep reinforcement learning from simulation to reality for robotics PDF

[57] Unifying Modern AI with Robotics: Survey on MDPs with Diffusion and Foundation Models PDF

[58] Generative World-Model Planning for Long-Horizon User Preference Evolution and Responsible Personalization PDF

[59] VL-SAFE: Vision-Language Guided Safety-Aware Reinforcement Learning with World Models for Autonomous Driving PDF

Policy improvement through synthetic trajectory generation

[57] Unifying Modern AI with Robotics: Survey on MDPs with Diffusion and Foundation Models PDF

[60] Voxposer: Composable 3d value maps for robotic manipulation with language models PDF

[61] Motiondiffuser: Controllable multi-agent motion prediction using diffusion PDF

[62] A survey on diffusion policy for robotic manipulation: Taxonomy, analysis, and future directions PDF

[63] RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches PDF

[64] Pirlnav: Pretraining with imitation and rl finetuning for objectnav PDF

[65] LLaRA: Supercharging Robot Learning Data for Vision-Language Policy PDF

[66] Learning generalizable manipulation policy with adapter-based parameter fine-tuning PDF

[67] Improving agent behaviors with rl fine-tuning for autonomous driving PDF

[68] Foundation models in robotics PDF

Table of Contents