WorldGym: World Model as An Environment for Policy Evaluation

ICLR 2026 Conference SubmissionAnonymous Authors
World modelvideo generationpolicy evaluationgenerative simulators
Abstract:

Evaluating robot control policies is difficult: real-world testing is costly, and handcrafted simulators require manual effort to improve in realism and generality. We propose a world-model-based policy evaluation environment (WorldGym), an autoregressive, action-conditioned video generation model which serves as a proxy to real world environments. Policies are evaluated via Monte Carlo rollouts in the world model, with a vision-language model providing rewards. We evaluate a set of VLA-based real-robot policies in the world model using only initial frames from real robots, and show that policy success rates within the world model highly correlate with real-world success rates. Moreoever, we show that WorldGym is able to preserve relative policy rankings across different policy versions, sizes, and training checkpoints. Due to requiring only a single start frame as input, the world model further enables efficient evaluation of robot policies' generalization ability on novel tasks and environments. We find that modern VLA-based robot policies still struggle to distinguish object shapes and can become distracted by adversarial facades of objects. While generating highly realistic object interaction remains challenging, WorldGym faithfully emulates robot motions and offers a practical starting point for safe and reproducible policy evaluation before deployment.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WorldGym, an autoregressive video generation model designed to evaluate robot control policies through simulated rollouts. It resides in the 'Autoregressive Video Generation' leaf of the taxonomy, which contains five papers total. This leaf sits within the broader 'Video-Based World Models' branch, indicating a moderately populated research direction focused on pixel-space prediction. The taxonomy reveals that autoregressive approaches represent one of several competing paradigms for video-based world modeling, alongside diffusion-based methods and multi-view architectures, suggesting the paper operates in an active but not overcrowded subfield.

The taxonomy structure shows WorldGym's immediate neighbors include diffusion-based video generation methods and multi-view 3D-aware models, both exploring alternative architectures for visual prediction. The broader 'World Model Construction and Architecture' branch encompasses latent-space models and foundation model-based approaches, indicating diverse strategies beyond pixel-level autoregression. The 'Policy Evaluation and Testing' branch, though smaller with only two papers, directly aligns with WorldGym's core application. This positioning suggests the work bridges world model construction with policy assessment, a connection less emphasized in sibling papers focused primarily on model architecture.

Among the three contributions analyzed, the core WorldGym environment examined ten candidates with one appearing to provide overlapping prior work, while the single-model multi-task evaluation examined ten candidates with two potential refutations. The flexible diffusion horizon alignment contribution examined three candidates with none clearly refuting it. These statistics reflect a limited search scope of twenty-three total candidates, not an exhaustive literature review. The multi-task evaluation and core environment contributions show more substantial prior work overlap within this constrained search, while the alignment mechanism appears less directly addressed in examined papers.

Based on the limited search scope, WorldGym appears to occupy a recognizable position within autoregressive video modeling for robotics, with some contributions showing clearer connections to prior work than others. The analysis covers top-K semantic matches and does not claim comprehensive coverage of all relevant literature. The taxonomy context suggests the work contributes to an active research direction where multiple architectural paradigms compete, though the specific application to policy evaluation remains less densely explored than world model construction itself.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
23
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: world model based robot policy evaluation. The field centers on building predictive models of robotic environments and using them to assess or improve control policies without exhaustive real-world trials. The taxonomy organizes research into several main branches: World Model Construction and Architecture addresses how to learn or design these predictive models, often distinguishing video-based approaches (which generate visual rollouts) from other representations; Policy Learning with World Models explores training strategies that leverage imagined experience; Policy Evaluation and Testing focuses on benchmarking and validating learned policies in simulation or reality; Domain-Specific Applications targets particular robotic tasks such as manipulation or autonomous driving; Model-Based Control Strategies examines planning and optimization methods that exploit world models; and Comparative Studies and Surveys provide overviews of trade-offs across paradigms. Representative works like World Models Survey[5] and Control Strategies Review[8] synthesize these themes, while specialized efforts such as IRASim[13] and RoboScape[9] illustrate domain-tailored modeling. A particularly active line of work involves autoregressive video generation for world modeling, where systems predict future visual observations frame-by-frame to enable policy rollout in pixel space. WorldGym[0] sits squarely in this branch, emphasizing scalable video-based prediction for policy evaluation. Nearby efforts include Heterogeneous Masked Autoregression[23], which explores alternative masking strategies for temporal prediction, and WorldEval[28], which benchmarks the fidelity of such video world models. In contrast, Diwa[3] and Gemini Robotics Policies[49] integrate large-scale vision-language priors to enhance generalization, highlighting a tension between purely visual autoregressive models and multimodal architectures. Across these directions, key open questions revolve around balancing computational cost, sample efficiency, and the realism of generated rollouts—challenges that WorldGym[0] addresses by focusing on efficient autoregressive generation tailored to robotic policy evaluation workflows.

Claimed Contributions

WorldGym: world-model-based policy evaluation environment

The authors introduce WorldGym, a system that uses an autoregressive video generation model conditioned on robot actions to serve as a simulated environment for evaluating robot control policies. Policies are rolled out in this world model using Monte Carlo sampling, with a vision-language model providing task success rewards.

10 retrieved papers
Can Refute
Flexible diffusion horizon alignment for efficient policy rollouts

The authors propose a method to dynamically set the world model's prediction horizon to match each policy's action chunk size at inference time. This enables efficient video generation for policies with varying chunk sizes, utilizing hardware more effectively than fixed-horizon approaches.

3 retrieved papers
Single world model for multi-task, multi-environment policy evaluation

The authors demonstrate that training a single world model on diverse robot data from multiple tasks and environments produces policy value estimates that strongly correlate with real-world success rates. This approach leverages the observation that while tasks and policies vary, the physical world follows consistent laws.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WorldGym: world-model-based policy evaluation environment

The authors introduce WorldGym, a system that uses an autoregressive video generation model conditioned on robot actions to serve as a simulated environment for evaluating robot control policies. Policies are rolled out in this world model using Monte Carlo sampling, with a vision-language model providing task success rewards.

Contribution

Flexible diffusion horizon alignment for efficient policy rollouts

The authors propose a method to dynamically set the world model's prediction horizon to match each policy's action chunk size at inference time. This enables efficient video generation for policies with varying chunk sizes, utilizing hardware more effectively than fixed-horizon approaches.

Contribution

Single world model for multi-task, multi-environment policy evaluation

The authors demonstrate that training a single world model on diverse robot data from multiple tasks and environments produces policy value estimates that strongly correlate with real-world success rates. This approach leverages the observation that while tasks and policies vary, the physical world follows consistent laws.