WoW!: World Models in a Closed-Loop World

ICLR 2026 Conference SubmissionAnonymous Authors
world modelsvideo generationembodied AIgenerative models
Abstract:

Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce WoW!, the first open platform that benchmarks WMs in a closed-loop setting that mirrors real agent-environment interactions. WoW! provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success—controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance. By centering evaluation on closed-loop outcomes, WoW! establishes a new benchmark for the systematic assessment of WMs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WoW!, a unified benchmarking platform for evaluating generative world models in closed-loop embodied settings, emphasizing task success over visual quality. It resides in the 'Unified Closed-Loop Benchmarking Platforms' leaf, which contains only two papers total. This leaf sits within the broader 'Closed-Loop World Model Evaluation Frameworks and Benchmarks' branch, indicating a relatively sparse research direction focused on standardized, multi-domain evaluation protocols. The small population of this leaf suggests that comprehensive, heterogeneous-model benchmarking platforms remain underexplored compared to domain-specific simulators or open-loop prediction benchmarks.

The taxonomy reveals neighboring work in 'Domain-Specific Closed-Loop Simulation Environments' (four papers on specialized simulators like autonomous driving testbeds) and 'Open-Loop Prediction Benchmarks and Limitations' (one paper critiquing open-loop metrics). The paper's scope explicitly excludes domain-specific simulators and open-loop evaluation systems, positioning it as a general-purpose alternative. Related branches like 'Generative World Models for Embodied Planning and Control' (covering video diffusion and autoregressive models) and 'Closed-Loop Planning Architectures' (visuomotor control, multi-agent systems) address complementary aspects—model design and planning integration—but do not provide unified evaluation frameworks across heterogeneous models.

Among 27 candidates examined, no contribution was clearly refuted. The 'World-In-World benchmark' examined 10 candidates with zero refutable overlaps; the 'unified closed-loop planning strategy' also examined 10 with none refutable; the 'empirical findings on visual quality and scaling' examined 7 with none refutable. This suggests that within the limited search scope, no prior work directly anticipates the combination of a standardized action API, multi-environment closed-loop protocol, and systematic comparison of diverse generative models. The empirical findings on controllability versus visual quality and data scaling laws appear particularly novel given the absence of refuting candidates.

Based on the top-27 semantic matches and taxonomy structure, the work addresses a recognized gap in unified, cross-model evaluation for embodied decision making. The sparse population of its taxonomy leaf and the absence of refuting candidates within the examined scope indicate substantive novelty. However, the analysis does not cover exhaustive literature beyond these 27 candidates, and the field's rapid evolution means additional relevant work may exist outside this search window. The contribution appears most novel in its integrative benchmarking approach rather than in individual technical components.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
27
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: closed-loop evaluation of generative world models for embodied decision making. This field examines how learned or generative models of the environment can be validated through iterative interaction, where an agent's actions influence future observations and the model must adapt or plan accordingly. The taxonomy reflects a rich structure spanning multiple perspectives: some branches focus on unified benchmarking platforms and evaluation frameworks that standardize closed-loop testing (e.g., WoW World Models[0], World in World[5], DriveArena[3]), while others emphasize the design of generative world models themselves for planning and control (e.g., Embodied World Models[16], GS World[33]). Additional branches address closed-loop planning architectures, LLM-based task planning with environmental feedback (e.g., Inner Monologue[15], PlanAgent[14]), model-based reinforcement learning that tightly couples learning and optimization, end-to-end autonomous driving systems (e.g., End to End Driving[4]), cognitive and agentic frameworks (e.g., Agentic Robot[7]), data-centric adaptive learning, embodied reasoning with predictive perception, and specialized or emerging application domains. Foundational concepts and cross-domain perspectives tie these threads together, highlighting shared principles across robotics, autonomous vehicles, and interactive simulation. Several active lines of work reveal key trade-offs and open questions. One central tension is between open-loop model accuracy and closed-loop robustness: models that predict well in isolation may struggle when compounding errors arise from iterative replanning (Rethinking Closed Loop[6], Closing Loop Motion[1]). Another theme is the integration of large language models for high-level task decomposition versus low-level visuomotor control, where grounding symbolic plans in continuous feedback remains challenging (Grounding LLMs Planning[28], Learn by Interact[10]). WoW World Models[0] sits within the unified benchmarking cluster, emphasizing standardized closed-loop evaluation protocols that can compare diverse generative models across embodied tasks. This positions it closely alongside World in World[5], which also provides a comprehensive testbed, and contrasts with more application-specific frameworks like DriveArena[3] that target autonomous driving scenarios. By offering a broad evaluation platform, WoW World Models[0] addresses the need for reproducible, multi-domain assessment of how well generative world models support real-time decision making under feedback.

Claimed Contributions

World-In-World benchmark for closed-loop evaluation of world models

The authors present World-In-World, a novel benchmark platform that evaluates generative world models in closed-loop embodied settings rather than through isolated visual quality metrics. It includes four diverse tasks (Active Recognition, Image-Goal Navigation, Active Embodied Question Answering, and Robotic Manipulation) that measure task success as the primary metric, emphasizing practical utility for embodied agents.

10 retrieved papers
Unified closed-loop planning strategy with unified action API

The authors develop a unified framework consisting of a closed-loop online planning strategy (proposal-simulation-revision cycle) and a standardized action API that transforms action sequences into control inputs (text prompts, camera trajectories, or low-level actions). This enables heterogeneous world models with different input modalities to be evaluated consistently within the same protocol.

10 retrieved papers
Empirical findings on visual quality, data scaling, and inference-time scaling

The authors present three key empirical findings: visual quality alone does not ensure task success (controllability matters more), scaling post-training with action-observation data is more effective than upgrading pretrained video generators, and allocating more inference-time compute via online planning substantially improves closed-loop performance. They also present the first data scaling law for world models in embodied settings.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

World-In-World benchmark for closed-loop evaluation of world models

The authors present World-In-World, a novel benchmark platform that evaluates generative world models in closed-loop embodied settings rather than through isolated visual quality metrics. It includes four diverse tasks (Active Recognition, Image-Goal Navigation, Active Embodied Question Answering, and Robotic Manipulation) that measure task success as the primary metric, emphasizing practical utility for embodied agents.

Contribution

Unified closed-loop planning strategy with unified action API

The authors develop a unified framework consisting of a closed-loop online planning strategy (proposal-simulation-revision cycle) and a standardized action API that transforms action sequences into control inputs (text prompts, camera trajectories, or low-level actions). This enables heterogeneous world models with different input modalities to be evaluated consistently within the same protocol.

Contribution

Empirical findings on visual quality, data scaling, and inference-time scaling

The authors present three key empirical findings: visual quality alone does not ensure task success (controllability matters more), scaling post-training with action-observation data is more effective than upgrading pretrained video generators, and allocating more inference-time compute via online planning substantially improves closed-loop performance. They also present the first data scaling law for world models in embodied settings.