WoW!: World Models in a Closed-Loop World

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

world modelsvideo generationembodied AIgenerative models

Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce WoW!, the first open platform that benchmarks WMs in a closed-loop setting that mirrors real agent-environment interactions. WoW! provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success—controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance. By centering evaluation on closed-loop outcomes, WoW! establishes a new benchmark for the systematic assessment of WMs.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WoW!, a unified benchmarking platform for evaluating generative world models in closed-loop embodied settings, emphasizing task success over visual quality. It resides in the 'Unified Closed-Loop Benchmarking Platforms' leaf, which contains only two papers total. This leaf sits within the broader 'Closed-Loop World Model Evaluation Frameworks and Benchmarks' branch, indicating a relatively sparse research direction focused on standardized, multi-domain evaluation protocols. The small population of this leaf suggests that comprehensive, heterogeneous-model benchmarking platforms remain underexplored compared to domain-specific simulators or open-loop prediction benchmarks.

The taxonomy reveals neighboring work in 'Domain-Specific Closed-Loop Simulation Environments' (four papers on specialized simulators like autonomous driving testbeds) and 'Open-Loop Prediction Benchmarks and Limitations' (one paper critiquing open-loop metrics). The paper's scope explicitly excludes domain-specific simulators and open-loop evaluation systems, positioning it as a general-purpose alternative. Related branches like 'Generative World Models for Embodied Planning and Control' (covering video diffusion and autoregressive models) and 'Closed-Loop Planning Architectures' (visuomotor control, multi-agent systems) address complementary aspects—model design and planning integration—but do not provide unified evaluation frameworks across heterogeneous models.

Among 27 candidates examined, no contribution was clearly refuted. The 'World-In-World benchmark' examined 10 candidates with zero refutable overlaps; the 'unified closed-loop planning strategy' also examined 10 with none refutable; the 'empirical findings on visual quality and scaling' examined 7 with none refutable. This suggests that within the limited search scope, no prior work directly anticipates the combination of a standardized action API, multi-environment closed-loop protocol, and systematic comparison of diverse generative models. The empirical findings on controllability versus visual quality and data scaling laws appear particularly novel given the absence of refuting candidates.

Based on the top-27 semantic matches and taxonomy structure, the work addresses a recognized gap in unified, cross-model evaluation for embodied decision making. The sparse population of its taxonomy leaf and the absence of refuting candidates within the examined scope indicate substantive novelty. However, the analysis does not cover exhaustive literature beyond these 27 candidates, and the field's rapid evolution means additional relevant work may exist outside this search window. The contribution appears most novel in its integrative benchmarking approach rather than in individual technical components.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: closed-loop evaluation of generative world models for embodied decision making. This field examines how learned or generative models of the environment can be validated through iterative interaction, where an agent's actions influence future observations and the model must adapt or plan accordingly. The taxonomy reflects a rich structure spanning multiple perspectives: some branches focus on unified benchmarking platforms and evaluation frameworks that standardize closed-loop testing (e.g., WoW World Models[0], World in World[5], DriveArena[3]), while others emphasize the design of generative world models themselves for planning and control (e.g., Embodied World Models[16], GS World[33]). Additional branches address closed-loop planning architectures, LLM-based task planning with environmental feedback (e.g., Inner Monologue[15], PlanAgent[14]), model-based reinforcement learning that tightly couples learning and optimization, end-to-end autonomous driving systems (e.g., End to End Driving[4]), cognitive and agentic frameworks (e.g., Agentic Robot[7]), data-centric adaptive learning, embodied reasoning with predictive perception, and specialized or emerging application domains. Foundational concepts and cross-domain perspectives tie these threads together, highlighting shared principles across robotics, autonomous vehicles, and interactive simulation. Several active lines of work reveal key trade-offs and open questions. One central tension is between open-loop model accuracy and closed-loop robustness: models that predict well in isolation may struggle when compounding errors arise from iterative replanning (Rethinking Closed Loop[6], Closing Loop Motion[1]). Another theme is the integration of large language models for high-level task decomposition versus low-level visuomotor control, where grounding symbolic plans in continuous feedback remains challenging (Grounding LLMs Planning[28], Learn by Interact[10]). WoW World Models[0] sits within the unified benchmarking cluster, emphasizing standardized closed-loop evaluation protocols that can compare diverse generative models across embodied tasks. This positions it closely alongside World in World[5], which also provides a comprehensive testbed, and contrasts with more application-specific frameworks like DriveArena[3] that target autonomous driving scenarios. By offering a broad evaluation platform, WoW World Models[0] addresses the need for reproducible, multi-domain assessment of how well generative world models support real-time decision making under feedback.

Claimed Contributions

World-In-World benchmark for closed-loop evaluation of world models

10 retrieved papers

The authors present World-In-World, a novel benchmark platform that evaluates generative world models in closed-loop embodied settings rather than through isolated visual quality metrics. It includes four diverse tasks (Active Recognition, Image-Goal Navigation, Active Embodied Question Answering, and Robotic Manipulation) that measure task success as the primary metric, emphasizing practical utility for embodied agents.

10 retrieved papers

Unified closed-loop planning strategy with unified action API

10 retrieved papers

The authors develop a unified framework consisting of a closed-loop online planning strategy (proposal-simulation-revision cycle) and a standardized action API that transforms action sequences into control inputs (text prompts, camera trajectories, or low-level actions). This enables heterogeneous world models with different input modalities to be evaluated consistently within the same protocol.

10 retrieved papers

Empirical findings on visual quality, data scaling, and inference-time scaling

7 retrieved papers

The authors present three key empirical findings: visual quality alone does not ensure task success (controllability matters more), scaling post-training with action-observation data is more effective than upgrading pretrained video generators, and allocating more inference-time compute via online planning substantially improves closed-loop performance. They also present the first data scaling law for world models in embodied settings.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] World-in-world: World models in a closed-loop world PDF

Zhang JiaHan, Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Wei Yana, Shunchi Zhang, Wang, Jiahao, Yana Wei, Patel, Vishal M., Jiahao Wang, Liang, Paul Pu, Vishal M. Patel, Khashabi, Daniel, P. Liang, Peng Cheng, Daniel Khashabi, Chellappa, Rama, Cheng Peng, Shu, Tianmin, Ramalingam Chellappa, Yuille, Alan, Tianmin Shu, Du, Yilun, Alan Yuille, Chen, Jieneng, Yilun Du, Jieneng Chen (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

World-In-World benchmark for closed-loop evaluation of world models

[3] Drivearena: A closed-loop generative simulation platform for autonomous driving PDF

Cannot Refute

[4] End-to-end autonomous driving: Challenges and frontiers PDF

Cannot Refute

[5] World-in-world: World models in a closed-loop world PDF

Cannot Refute

[57] Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents PDF

Cannot Refute

[58] Robobench: A comprehensive evaluation benchmark for multimodal large language models as embodied brain PDF

Cannot Refute

[59] Embodiedeval: Evaluate multimodal llms as embodied agents PDF

Cannot Refute

[60] Crab: Cross-platfrom agent benchmark for multi-modal embodied language model agents PDF

Cannot Refute

[61] Embodied scene understanding for vision language models via metavqa PDF

Cannot Refute

[62] Doe-1: Closed-loop autonomous driving with large world model PDF

Cannot Refute

[63] On evaluation of embodied navigation agents PDF

Cannot Refute

Contribution

Unified closed-loop planning strategy with unified action API

[5] World-in-world: World models in a closed-loop world PDF

Cannot Refute

[64] Multi-Hypothesis Task Planning: integrating temporal AI planning and semantic world modeling for AUV inspections in unknown environments PDF

Cannot Refute

[65] Combined task and motion planning through an extensible planner-independent interface layer PDF

Cannot Refute

[66] Algebras of actions in an agent's representations of the world PDF

Cannot Refute

[67] Planning from Point Clouds over Continuous Actions for Multi-object Rearrangement PDF

Cannot Refute

[68] Features, projections, and representation change for generalized planning PDF

Cannot Refute

[69] AdaWorld: Learning Adaptable World Models with Latent Actions PDF

Cannot Refute

[70] A Multiagent Planning Architecture. PDF

Cannot Refute

[71] Action-based Representation for Stochastic Optimization of Complex Real-World RVRP PDF

Cannot Refute

[72] BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands PDF

Cannot Refute

Contribution

Empirical findings on visual quality, data scaling, and inference-time scaling

[5] World-in-world: World models in a closed-loop world PDF

Cannot Refute

[51] Gaia-2: A controllable multi-view generative world model for autonomous driving PDF

Cannot Refute

[52] Scalar: Scale-wise controllable visual autoregressive learning PDF

Cannot Refute

[53] Matrix-Game: Interactive World Foundation Model PDF

Cannot Refute

[54] Large scale GAN training for high fidelity natural image synthesis PDF

Cannot Refute

[55] Trade-offs in Virtual Grasping: The Interplay of Interaction Fidelity and Object Affordance PDF

Cannot Refute

[56] Vista: A generalizable driving world model with high fidelity and versatile controllability PDF

Cannot Refute

WoW!: World Models in a Closed-Loop World

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] World-in-world: World models in a closed-loop world PDF

Contribution Analysis

World-In-World benchmark for closed-loop evaluation of world models

[3] Drivearena: A closed-loop generative simulation platform for autonomous driving PDF

[4] End-to-end autonomous driving: Challenges and frontiers PDF

[5] World-in-world: World models in a closed-loop world PDF

[57] Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents PDF

[58] Robobench: A comprehensive evaluation benchmark for multimodal large language models as embodied brain PDF

[59] Embodiedeval: Evaluate multimodal llms as embodied agents PDF

[60] Crab: Cross-platfrom agent benchmark for multi-modal embodied language model agents PDF

[61] Embodied scene understanding for vision language models via metavqa PDF

[62] Doe-1: Closed-loop autonomous driving with large world model PDF

[63] On evaluation of embodied navigation agents PDF

Unified closed-loop planning strategy with unified action API

[5] World-in-world: World models in a closed-loop world PDF

[64] Multi-Hypothesis Task Planning: integrating temporal AI planning and semantic world modeling for AUV inspections in unknown environments PDF

[65] Combined task and motion planning through an extensible planner-independent interface layer PDF

[66] Algebras of actions in an agent's representations of the world PDF

[67] Planning from Point Clouds over Continuous Actions for Multi-object Rearrangement PDF

[68] Features, projections, and representation change for generalized planning PDF

[69] AdaWorld: Learning Adaptable World Models with Latent Actions PDF

[70] A Multiagent Planning Architecture. PDF

[71] Action-based Representation for Stochastic Optimization of Complex Real-World RVRP PDF

[72] BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands PDF

Empirical findings on visual quality, data scaling, and inference-time scaling

[5] World-in-world: World models in a closed-loop world PDF

[51] Gaia-2: A controllable multi-view generative world model for autonomous driving PDF

[52] Scalar: Scale-wise controllable visual autoregressive learning PDF

[53] Matrix-Game: Interactive World Foundation Model PDF

[54] Large scale GAN training for high fidelity natural image synthesis PDF

[55] Trade-offs in Virtual Grasping: The Interplay of Interaction Fidelity and Object Affordance PDF

[56] Vista: A generalizable driving world model with high fidelity and versatile controllability PDF

Table of Contents