WorldGym: World Model as An Environment for Policy Evaluation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

World modelvideo generationpolicy evaluationgenerative simulators

Evaluating robot control policies is difficult: real-world testing is costly, and handcrafted simulators require manual effort to improve in realism and generality. We propose a world-model-based policy evaluation environment (WorldGym), an autoregressive, action-conditioned video generation model which serves as a proxy to real world environments. Policies are evaluated via Monte Carlo rollouts in the world model, with a vision-language model providing rewards. We evaluate a set of VLA-based real-robot policies in the world model using only initial frames from real robots, and show that policy success rates within the world model highly correlate with real-world success rates. Moreoever, we show that WorldGym is able to preserve relative policy rankings across different policy versions, sizes, and training checkpoints. Due to requiring only a single start frame as input, the world model further enables efficient evaluation of robot policies' generalization ability on novel tasks and environments. We find that modern VLA-based robot policies still struggle to distinguish object shapes and can become distracted by adversarial facades of objects. While generating highly realistic object interaction remains challenging, WorldGym faithfully emulates robot motions and offers a practical starting point for safe and reproducible policy evaluation before deployment.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WorldGym, an autoregressive video generation model designed to evaluate robot control policies through simulated rollouts. It resides in the 'Autoregressive Video Generation' leaf of the taxonomy, which contains five papers total. This leaf sits within the broader 'Video-Based World Models' branch, indicating a moderately populated research direction focused on pixel-space prediction. The taxonomy reveals that autoregressive approaches represent one of several competing paradigms for video-based world modeling, alongside diffusion-based methods and multi-view architectures, suggesting the paper operates in an active but not overcrowded subfield.

The taxonomy structure shows WorldGym's immediate neighbors include diffusion-based video generation methods and multi-view 3D-aware models, both exploring alternative architectures for visual prediction. The broader 'World Model Construction and Architecture' branch encompasses latent-space models and foundation model-based approaches, indicating diverse strategies beyond pixel-level autoregression. The 'Policy Evaluation and Testing' branch, though smaller with only two papers, directly aligns with WorldGym's core application. This positioning suggests the work bridges world model construction with policy assessment, a connection less emphasized in sibling papers focused primarily on model architecture.

Among the three contributions analyzed, the core WorldGym environment examined ten candidates with one appearing to provide overlapping prior work, while the single-model multi-task evaluation examined ten candidates with two potential refutations. The flexible diffusion horizon alignment contribution examined three candidates with none clearly refuting it. These statistics reflect a limited search scope of twenty-three total candidates, not an exhaustive literature review. The multi-task evaluation and core environment contributions show more substantial prior work overlap within this constrained search, while the alignment mechanism appears less directly addressed in examined papers.

Based on the limited search scope, WorldGym appears to occupy a recognizable position within autoregressive video modeling for robotics, with some contributions showing clearer connections to prior work than others. The analysis covers top-K semantic matches and does not claim comprehensive coverage of all relevant literature. The taxonomy context suggests the work contributes to an active research direction where multiple architectural paradigms compete, though the specific application to policy evaluation remains less densely explored than world model construction itself.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: world model based robot policy evaluation. The field centers on building predictive models of robotic environments and using them to assess or improve control policies without exhaustive real-world trials. The taxonomy organizes research into several main branches: World Model Construction and Architecture addresses how to learn or design these predictive models, often distinguishing video-based approaches (which generate visual rollouts) from other representations; Policy Learning with World Models explores training strategies that leverage imagined experience; Policy Evaluation and Testing focuses on benchmarking and validating learned policies in simulation or reality; Domain-Specific Applications targets particular robotic tasks such as manipulation or autonomous driving; Model-Based Control Strategies examines planning and optimization methods that exploit world models; and Comparative Studies and Surveys provide overviews of trade-offs across paradigms. Representative works like World Models Survey[5] and Control Strategies Review[8] synthesize these themes, while specialized efforts such as IRASim[13] and RoboScape[9] illustrate domain-tailored modeling. A particularly active line of work involves autoregressive video generation for world modeling, where systems predict future visual observations frame-by-frame to enable policy rollout in pixel space. WorldGym[0] sits squarely in this branch, emphasizing scalable video-based prediction for policy evaluation. Nearby efforts include Heterogeneous Masked Autoregression[23], which explores alternative masking strategies for temporal prediction, and WorldEval[28], which benchmarks the fidelity of such video world models. In contrast, Diwa[3] and Gemini Robotics Policies[49] integrate large-scale vision-language priors to enhance generalization, highlighting a tension between purely visual autoregressive models and multimodal architectures. Across these directions, key open questions revolve around balancing computational cost, sample efficiency, and the realism of generated rollouts—challenges that WorldGym[0] addresses by focusing on efficient autoregressive generation tailored to robotic policy evaluation workflows.

Claimed Contributions

WorldGym: world-model-based policy evaluation environment

Can Refute

10 retrieved papers

The authors introduce WorldGym, a system that uses an autoregressive video generation model conditioned on robot actions to serve as a simulated environment for evaluating robot control policies. Policies are rolled out in this world model using Monte Carlo sampling, with a vision-language model providing task success rewards.

10 retrieved papers

Can Refute

Flexible diffusion horizon alignment for efficient policy rollouts

3 retrieved papers

The authors propose a method to dynamically set the world model's prediction horizon to match each policy's action chunk size at inference time. This enables efficient video generation for policies with varying chunk sizes, utilizing hardware more effectively than fixed-horizon approaches.

3 retrieved papers

Single world model for multi-task, multi-environment policy evaluation

Can Refute

10 retrieved papers

The authors demonstrate that training a single world model on diverse robot data from multiple tasks and environments produces policy value estimates that strongly correlate with real-world success rates. This approach leverages the observation that while tasks and policies vary, the physical world follows consistent laws.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Evaluating Robot Policies in a World Model PDF

J Quevedo, P Liang, S Yang (2025)

[23] Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression PDF

Wang, Lirui, Zhao, Kevin, Lirui Wang, Liu Chao-qi, Kevin Zhao, Chen Xinlei, Chaoqi Liu, Xinlei Chen (2025)

[28] WorldEval: World Model as Real-World Robot Policies Evaluator PDF

LI Yaxuan, Zhu Yichen, Yaxuan Li, Wen Junjie, Yichen Zhu, Shen, Chaomin, Junjie Wen, Xu Yi, Chaomin Shen, Yi Xu (2025)

[49] Evaluating Gemini Robotics Policies in a Veo World Simulator PDF

Gemini Robotics Team, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas Kipf, Sean Kirmani, Fangchen Liu, Anirudha Majumdar, Andrew Marmon, Carolina Parada, Yulia Rubanova, Dhruv Shah, Vikas Sindhwani, Jie Tan, Fei Xia, Ted Xiao, Sherry Yang, Wenhao Yu, Allan Zhou (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WorldGym: world-model-based policy evaluation environment

[36] Scalable policy evaluation with video world models PDF

Can Refute

[1] Evaluating Robot Policies in a World Model PDF

Cannot Refute

[23] Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression PDF

Cannot Refute

[51] Learning World Models for Interactive Video Generation PDF

Cannot Refute

[52] Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers PDF

Cannot Refute

[53] Vid2World: Crafting Video Diffusion Models to Interactive World Models PDF

Cannot Refute

[54] Doe-1: Closed-loop autonomous driving with large world model PDF

Cannot Refute

[55] Pre-trained video generative models as world simulators PDF

Cannot Refute

[56] ivideogpt: Interactive videogpts are scalable world models PDF

Cannot Refute

[57] CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion PDF

Cannot Refute

Contribution

Flexible diffusion horizon alignment for efficient policy rollouts

[65] Multimodal diffusion transformer: Learning versatile behavior from multimodal goals PDF

Cannot Refute

[66] Mixture of Horizons in Action Chunking PDF

Cannot Refute

[67] Adaptive planning hierarchical diffuser for multi-step action execution in offline reinforcement learning PDF

Cannot Refute

Contribution

Single world model for multi-task, multi-environment policy evaluation

[58] Trajectory World Models for Heterogeneous Environments PDF

Can Refute

[59] Mastering diverse domains through world models PDF

Can Refute

[1] Evaluating Robot Policies in a World Model PDF

Cannot Refute

[5] A step toward world models: A survey on robotic manipulation PDF

Cannot Refute

[15] Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets PDF

Cannot Refute

[60] Learning interactive real-world simulators PDF

Cannot Refute

[61] Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation PDF

Cannot Refute

[62] Recurrent world models facilitate policy evolution PDF

Cannot Refute

[63] WorldGPT: Empowering LLM as Multimodal World Model PDF

Cannot Refute

[64] One Policy but Many Worlds: A Scalable Unified Policy for Versatile Humanoid Locomotion PDF

Cannot Refute

WorldGym: World Model as An Environment for Policy Evaluation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Evaluating Robot Policies in a World Model PDF

[23] Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression PDF

[28] WorldEval: World Model as Real-World Robot Policies Evaluator PDF

[49] Evaluating Gemini Robotics Policies in a Veo World Simulator PDF

Contribution Analysis

WorldGym: world-model-based policy evaluation environment

[36] Scalable policy evaluation with video world models PDF

[1] Evaluating Robot Policies in a World Model PDF

[23] Learning Real-World Action-Video Dynamics with Heterogeneous Masked Autoregression PDF

[51] Learning World Models for Interactive Video Generation PDF

[52] Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers PDF

[53] Vid2World: Crafting Video Diffusion Models to Interactive World Models PDF

[54] Doe-1: Closed-loop autonomous driving with large world model PDF

[55] Pre-trained video generative models as world simulators PDF

[56] ivideogpt: Interactive videogpts are scalable world models PDF

[57] CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion PDF

Flexible diffusion horizon alignment for efficient policy rollouts

[65] Multimodal diffusion transformer: Learning versatile behavior from multimodal goals PDF

[66] Mixture of Horizons in Action Chunking PDF

[67] Adaptive planning hierarchical diffuser for multi-step action execution in offline reinforcement learning PDF

Single world model for multi-task, multi-environment policy evaluation

[58] Trajectory World Models for Heterogeneous Environments PDF

[59] Mastering diverse domains through world models PDF

[1] Evaluating Robot Policies in a World Model PDF

[5] A step toward world models: A survey on robotic manipulation PDF

[15] Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets PDF

[60] Learning interactive real-world simulators PDF

[61] Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation PDF

[62] Recurrent world models facilitate policy evolution PDF

[63] WorldGPT: Empowering LLM as Multimodal World Model PDF

[64] One Policy but Many Worlds: A Scalable Unified Policy for Versatile Humanoid Locomotion PDF

Table of Contents