Ctrl-World: A Controllable Generative World Model for Robot Manipulation

ICLR 2026 Conference SubmissionAnonymous Authors
World ModelVision-Language-Action Model (VLA)
Abstract:

Generalist robot policies can now perform a wide range of manipulation skills, but evaluating and improving their ability with unfamiliar objects and instructions remains a significant challenge. Rigorous evaluation requires a large number of real-world rollouts, while systematic improvement demands additional corrective data with expert labels. Both of these processes are slow, costly, and difficult to scale. World models offer a promising, scalable alternative by enabling policies to rollout within imagination space. However, a key challenge is building a controllable world model that can handle multi-step interactions with generalist robot policies. This requires a world model compatible with modern generalist policies by supporting multi-view prediction, fine-grained action control, and consistent long-horizon interactions, which is not achieved by previous works. In this paper, we make a step forward by introducing a controllable multi-view world model that can be used to evaluate and improve the instruction-following ability of generalist robot policies. Our model maintains long-horizon consistency with a pose-conditioned memory retrieval mechanism and achieves precise action control through frame-level action conditioning. Trained on the DROID dataset (95k trajectories, 564 scenes), our model generates spatially and temporally consistent trajectories under novel scenarios and new camera placements for over 20 seconds. We show that our method can accurately rank policy performance without real-world robot rollouts. Moreover, by synthesizing successful trajectories in imagination and using them for supervised fine-tuning, our approach can improve policy success by 44.7%. Videos can be found at https://sites.google.com/view/ctrl-world.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Ctrl-World, a controllable multi-view world model designed to evaluate and improve generalist robot policies through imagination-based rollouts. It resides in the 'Evaluation and Controllability of World Models' leaf, which contains only two papers in the entire taxonomy of fifty works. This sparse population suggests the paper targets a relatively underexplored niche within the broader world model landscape, focusing specifically on rigorous evaluation and controllability rather than architectural innovation or training paradigms that dominate other branches.

The taxonomy reveals that most world model research clusters around architecture design (object-centric, geometric, multi-view representations), training methods (diffusion models, pretraining, data augmentation), and control frameworks (MPC, reinforcement learning, search-based planning). Ctrl-World's emphasis on multi-view prediction and pose-conditioned memory retrieval connects it to the 'Multi-View and View-Invariant Representations' leaf, yet its core contribution diverges by prioritizing downstream controllability and policy evaluation over representation learning alone. Neighboring branches like 'Model Predictive Control Frameworks' and 'Reinforcement Learning with World Models' address control but typically assume model quality rather than systematically evaluating it.

Among twenty-nine candidates examined, the first contribution (controllable multi-view world model) encountered two refutable candidates out of ten examined, indicating some prior work addresses similar multi-view controllability challenges. The second contribution (imagination-based policy evaluation) and third contribution (policy improvement via synthetic trajectories) each examined nine and ten candidates respectively, with zero refutations found. This suggests that while the core world model architecture has partial precedent within the limited search scope, the specific applications to policy evaluation and improvement appear less directly anticipated by the examined literature.

Given the limited search scope of twenty-nine candidates and the sparse two-paper leaf in which the work resides, the analysis captures a snapshot rather than exhaustive coverage. The controllable multi-view model shows some overlap with prior efforts, but the integration of evaluation and improvement workflows for generalist policies appears less saturated. The taxonomy structure indicates this evaluation-centric angle remains relatively underpopulated compared to architecture and training research directions.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Controllable world model for robot manipulation. The field organizes around several major branches that reflect different facets of building and deploying predictive models for robotic systems. World Model Architecture and Representation explores how to structure internal models, ranging from object-centric approaches like FOCUS[1] and Gaussian World Models[2] to particle-based representations such as ParticleFormer[20]. World Model Training and Learning Paradigms addresses how these models acquire knowledge, including pre-training strategies like those in Pre-training World Models[42] and masked prediction methods exemplified by Masked World Models[35]. Control and Planning with World Models focuses on leveraging learned dynamics for decision-making, with works like Deep Koopman MPC[34] and Hierarchical Task MPC[21] demonstrating model-predictive control frameworks. Vision-Language-Action Models and Foundation Models examines large-scale multimodal systems such as Goal-VLA[22] and RoboEngine[27], while Specialized World Model Applications and Domains targets specific settings like dexterous manipulation in DexSim2Real[26] or interactive digital twins in Interactive Digital Twins[28]. Model Learning and Identification Methods investigates system identification techniques, and Evaluation and Controllability of World Models scrutinizes how well these models support reliable control. Recent work highlights tensions between generality and task-specific fidelity, with foundation models like RoboHorizon[10] and World4Omni[25] pursuing broad applicability, while specialized approaches such as IRASim[3] and FlowDreamer[16] emphasize domain-tailored accuracy. A key open question is how to balance model complexity with sample efficiency and real-world transferability, as seen in comparisons like Model-Based vs Free[29]. Ctrl-World[0] situates itself within the Evaluation and Controllability branch, emphasizing rigorous assessment of how world models enable precise manipulation control. This focus contrasts with purely architectural innovations like View-invariant World Models[4] or training paradigm shifts in MoDem-V2[7], instead prioritizing the downstream controllability and reliability that determine whether a learned model can safely guide real robotic actions. By concentrating on evaluation metrics and controllability guarantees, Ctrl-World[0] addresses a critical gap in ensuring that predictive models translate into robust manipulation performance.

Claimed Contributions

Ctrl-World: A controllable multi-view world model for robot manipulation

The authors propose a controllable world model that supports multi-view prediction, fine-grained action control via frame-level conditioning, and long-horizon consistency through pose-conditioned memory retrieval. This model enables generalist robot policies to perform rollouts in imagination space for both evaluation and improvement.

10 retrieved papers
Can Refute
Imagination-based policy evaluation method

The authors demonstrate that their world model can accurately evaluate generalist robot policies by performing rollouts in imagination space, with evaluation results that align with real-world policy performance rankings without requiring actual robot rollouts.

9 retrieved papers
Policy improvement through synthetic trajectory generation

The authors show that their world model can generate synthetic successful trajectories entirely in imagination, which can then be used to fine-tune generalist policies via supervised learning, significantly improving their instruction-following capabilities on novel tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Ctrl-World: A controllable multi-view world model for robot manipulation

The authors propose a controllable world model that supports multi-view prediction, fine-grained action control via frame-level conditioning, and long-horizon consistency through pose-conditioned memory retrieval. This model enables generalist robot policies to perform rollouts in imagination space for both evaluation and improvement.

Contribution

Imagination-based policy evaluation method

The authors demonstrate that their world model can accurately evaluate generalist robot policies by performing rollouts in imagination space, with evaluation results that align with real-world policy performance rankings without requiring actual robot rollouts.

Contribution

Policy improvement through synthetic trajectory generation

The authors show that their world model can generate synthetic successful trajectories entirely in imagination, which can then be used to fine-tune generalist policies via supervised learning, significantly improving their instruction-following capabilities on novel tasks.