Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

world modelsroboticsmanipulationmodel-based planningimitation learningvideo generation

Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model's latent diffusion process, harnessing the model's rich priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected total cumulative rewards), which are similarly encoded as latent frames, enabling test-time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks (98.5% and 67.1% average success rates, respectively) and the highest average score in challenging real-world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model-based policies, and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model-based planning to achieve even higher success rates in challenging tasks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Cosmos Policy proposes a single-stage fine-tuning approach that adapts a large pretrained video generation model (Cosmos-Predict2) to generate robot actions as latent frames within a diffusion process. It resides in the 'Direct Video-to-Action Policy Learning' leaf, which contains only three papers including the original. This leaf sits within the broader 'Video Generation Models for Robot Policy Learning' branch, indicating a relatively focused but not overcrowded research direction. The taxonomy shows four sibling leaves under video generation (direct action, trajectory planning, world models, evaluation), suggesting this is an active area with multiple methodological variants.

The taxonomy reveals that direct video-to-action methods neighbor 'Video Generation for Trajectory Planning and Simulation' (four papers) and 'Video-Based World Models for Control' (four papers), which emphasize explicit trajectory synthesis or dynamics modeling rather than end-to-end action generation. Parallel branches like 'Vision-Language-Action Model Adaptation' (eleven papers across five leaves) and 'Diffusion Models for Robot Manipulation' represent alternative paradigms that integrate language grounding or iterative denoising frameworks. Cosmos Policy's approach diverges from VLA methods by avoiding language tokenization and from diffusion-based trajectory planners by embedding actions directly in video latents.

Among twenty-five candidates examined across three contributions, none clearly refute the proposed approach. The first contribution (single-stage fine-tuning) examined ten candidates with zero refutable matches; the second (latent frame injection) similarly found no overlapping prior work among ten candidates; the third (unified joint training) reviewed five candidates without refutation. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific combination of single-stage adaptation, latent frame encoding, and joint policy-world model-value training appears less explored. However, the search scale (twenty-five papers) is modest relative to the fifty-paper taxonomy.

Given the restricted literature search and the sparse population of the immediate taxonomy leaf (three papers total), the analysis indicates potential novelty but cannot claim exhaustive coverage. The sibling papers in 'Direct Video-to-Action Policy Learning' likely represent the closest prior work, yet detailed comparison statistics are unavailable. The broader video generation branch (thirteen papers across four leaves) and neighboring VLA methods (eleven papers) provide context but do not directly overlap with the latent-action encoding scheme described.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: fine-tuning video models for robot visuomotor control and planning. The field has evolved into a rich landscape organized around how visual and temporal information is leveraged for robotic decision-making. At the highest level, one major branch focuses on video generation models that learn policies directly from predicted or generated visual sequences, treating video prediction as a planning substrate. Another prominent direction explores vision-language-action (VLA) model adaptation, where large pre-trained multimodal models are fine-tuned to map language instructions and visual observations to robot actions. Pre-training and transfer learning methods form a third pillar, emphasizing how representations learned from diverse data sources can be adapted to downstream manipulation tasks. Diffusion models have emerged as a distinct branch, applying iterative denoising frameworks to generate action trajectories or visual plans. Language model-based control, visual representation learning, and imitation learning from demonstrations each carve out their own methodological niches, while specialized applications target domains like surgery or navigation. Supporting all of these are benchmarks and datasets that provide standardized evaluation, and techniques for aligning generative models to human preferences or task-specific objectives. Within the video generation branch, a particularly active line of work explores direct video-to-action policy learning, where models predict future visual states and extract control signals without explicit action labels during pre-training. Cosmos Policy[0] exemplifies this approach by fine-tuning a video generation model to produce both visual rollouts and corresponding actions, closely related to methods like Universal Text-Guided[5] and Dreamitate[50] that similarly leverage video prediction for planning. In contrast, works such as VLM Predictive Control[4] and Geometry-aware 4D[3] emphasize integrating geometric reasoning or closed-loop feedback into video-based planning, highlighting a trade-off between open-loop generation and reactive control. Across the taxonomy, a recurring theme is the tension between leveraging large-scale pre-trained models—whether video generators, VLAs like Fine-tuning VLA[1], or diffusion-based planners surveyed in Diffusion Robotic Survey[2]—and the need for efficient adaptation to embodied settings with limited robot data. Cosmos Policy[0] sits squarely in the video generation lineage, sharing the philosophy of Universal Text-Guided[5] that visual prediction can serve as a powerful prior for control, while differing from VLA approaches that rely more heavily on language grounding and action tokenization.

Claimed Contributions

Cosmos Policy: single-stage fine-tuning approach for video-based robot policies

10 retrieved papers

The authors propose Cosmos Policy, which adapts a pretrained video generation model into a robot policy via single-stage fine-tuning without architectural changes. This contrasts with prior works that require multiple training stages and new architectural components for action generation.

10 retrieved papers

Latent frame injection mechanism for incorporating multiple modalities

10 retrieved papers

The authors introduce latent frame injection, a mechanism that encodes robot actions, proprioception, state values, and multiple camera views as latent frames within the video model's native diffusion process. This enables the model to handle new modalities without architectural modifications.

10 retrieved papers

Unified joint training of policy, world model, and value function

5 retrieved papers

The authors develop a unified training approach where a single model simultaneously learns to predict actions, future states, and state values through the video diffusion objective. This enables test-time planning via best-of-N sampling using predicted future states and values.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Learning universal policies via text-guided video generation PDF

Du, Yilun, Yang Mengjiao, Yilun Du, Dai Bo, Mengjiao Yang, Dai, Hanjun, Bo Dai, Nachum, Ofir, H. Dai, Tenenbaum, Joshua B., Ofir Nachum, Schuurmans, Dale, J. Tenenbaum, Abbeel, Pieter, D. Schuurmans, P. Abbeel (2023)

[50] Dreamitate: Real-World Visuomotor Policy Learning via Video Generation PDF

Liu, Ruoshi, Junbang Liang, Ruoshi Liu, Sudhakar, Sruthi, Ege Ozguroglu, Dave, Achal, Sruthi Sudhakar, Tokmakov, Pavel, Achal Dave, Song, Shuran, P. Tokmakov, Vondrick, Carl, Shuran Song, Carl Vondrick (2024) • Conference on Robot Learning

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Cosmos Policy: single-stage fine-tuning approach for video-based robot policies

[1] Fine-tuning vision-language-action models: Optimizing speed and success PDF

Cannot Refute

[40] Robotic-clip: Fine-tuning clip on action data for robotic applications PDF

Cannot Refute

[51] Octo: An open-source generalist robot policy PDF

Cannot Refute

[52] Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation PDF

Cannot Refute

[53] GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation PDF

Cannot Refute

[54] Hand-object interaction pretraining from videos PDF

Cannot Refute

[55] Lift3d foundation policy: Lifting 2d large-scale pretrained models for robust 3d robotic manipulation PDF

Cannot Refute

[56] Navila: Legged robot vision-language-action model for navigation PDF

Cannot Refute

[57] Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction PDF

Cannot Refute

[58] Vidar: Embodied video diffusion model for generalist manipulation PDF

Cannot Refute

Contribution

Latent frame injection mechanism for incorporating multiple modalities

[59] Vid-gpt: Introducing gpt-style autoregressive generation in video diffusion models PDF

Cannot Refute

[60] LDMVFI: Video Frame Interpolation with Latent Diffusion Models PDF

Cannot Refute

[61] MagicVideo: Efficient Video Generation With Latent Diffusion Models PDF

Cannot Refute

[62] Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models PDF

Cannot Refute

[63] Latent-reframe: Enabling camera control for video diffusion models without training PDF

Cannot Refute

[64] Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition PDF

Cannot Refute

[65] Videogen: A reference-guided latent diffusion approach for high definition text-to-video generation PDF

Cannot Refute

[66] Signgen: End-to-end sign language video generation with latent diffusion PDF

Cannot Refute

[67] VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate PDF

Cannot Refute

[68] Latent Flow Diffusion for Deepfake Video Generation PDF

Cannot Refute

Contribution

Unified joint training of policy, world model, and value function

[69] MinD: Learning A Dual-System World Model for Real-Time Planning and Implicit Risk Analysis PDF

Cannot Refute

[70] Diffusion spectral representation for reinforcement learning PDF

Cannot Refute

[71] Integrating World Models into Vision Language Action and Navigation: A Comprehensive Survey PDF

Cannot Refute

[72] AstraNav-World: World Model for Foresight Control and Consistency PDF

Cannot Refute

[73] A Survey of Embodied World Models PDF

Cannot Refute

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Learning universal policies via text-guided video generation PDF

[50] Dreamitate: Real-World Visuomotor Policy Learning via Video Generation PDF

Contribution Analysis

Cosmos Policy: single-stage fine-tuning approach for video-based robot policies

[1] Fine-tuning vision-language-action models: Optimizing speed and success PDF

[40] Robotic-clip: Fine-tuning clip on action data for robotic applications PDF

[51] Octo: An open-source generalist robot policy PDF

[52] Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation PDF

[53] GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation PDF

[54] Hand-object interaction pretraining from videos PDF

[55] Lift3d foundation policy: Lifting 2d large-scale pretrained models for robust 3d robotic manipulation PDF

[56] Navila: Legged robot vision-language-action model for navigation PDF

[57] Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction PDF

[58] Vidar: Embodied video diffusion model for generalist manipulation PDF

Latent frame injection mechanism for incorporating multiple modalities

[59] Vid-gpt: Introducing gpt-style autoregressive generation in video diffusion models PDF

[60] LDMVFI: Video Frame Interpolation with Latent Diffusion Models PDF

[61] MagicVideo: Efficient Video Generation With Latent Diffusion Models PDF

[62] Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models PDF

[63] Latent-reframe: Enabling camera control for video diffusion models without training PDF

[64] Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition PDF

[65] Videogen: A reference-guided latent diffusion approach for high definition text-to-video generation PDF

[66] Signgen: End-to-end sign language video generation with latent diffusion PDF

[67] VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate PDF

[68] Latent Flow Diffusion for Deepfake Video Generation PDF

Unified joint training of policy, world model, and value function

[69] MinD: Learning A Dual-System World Model for Real-Time Planning and Implicit Risk Analysis PDF

[70] Diffusion spectral representation for reinforcement learning PDF

[71] Integrating World Models into Vision Language Action and Navigation: A Comprehensive Survey PDF

[72] AstraNav-World: World Model for Foresight Control and Consistency PDF

[73] A Survey of Embodied World Models PDF

Table of Contents