Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

ICLR 2026 Conference SubmissionAnonymous Authors
world modelsroboticsmanipulationmodel-based planningimitation learningvideo generation
Abstract:

Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model's latent diffusion process, harnessing the model's rich priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected total cumulative rewards), which are similarly encoded as latent frames, enabling test-time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks (98.5% and 67.1% average success rates, respectively) and the highest average score in challenging real-world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model-based policies, and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model-based planning to achieve even higher success rates in challenging tasks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Cosmos Policy proposes a single-stage fine-tuning approach that adapts a large pretrained video generation model (Cosmos-Predict2) to generate robot actions as latent frames within a diffusion process. It resides in the 'Direct Video-to-Action Policy Learning' leaf, which contains only three papers including the original. This leaf sits within the broader 'Video Generation Models for Robot Policy Learning' branch, indicating a relatively focused but not overcrowded research direction. The taxonomy shows four sibling leaves under video generation (direct action, trajectory planning, world models, evaluation), suggesting this is an active area with multiple methodological variants.

The taxonomy reveals that direct video-to-action methods neighbor 'Video Generation for Trajectory Planning and Simulation' (four papers) and 'Video-Based World Models for Control' (four papers), which emphasize explicit trajectory synthesis or dynamics modeling rather than end-to-end action generation. Parallel branches like 'Vision-Language-Action Model Adaptation' (eleven papers across five leaves) and 'Diffusion Models for Robot Manipulation' represent alternative paradigms that integrate language grounding or iterative denoising frameworks. Cosmos Policy's approach diverges from VLA methods by avoiding language tokenization and from diffusion-based trajectory planners by embedding actions directly in video latents.

Among twenty-five candidates examined across three contributions, none clearly refute the proposed approach. The first contribution (single-stage fine-tuning) examined ten candidates with zero refutable matches; the second (latent frame injection) similarly found no overlapping prior work among ten candidates; the third (unified joint training) reviewed five candidates without refutation. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific combination of single-stage adaptation, latent frame encoding, and joint policy-world model-value training appears less explored. However, the search scale (twenty-five papers) is modest relative to the fifty-paper taxonomy.

Given the restricted literature search and the sparse population of the immediate taxonomy leaf (three papers total), the analysis indicates potential novelty but cannot claim exhaustive coverage. The sibling papers in 'Direct Video-to-Action Policy Learning' likely represent the closest prior work, yet detailed comparison statistics are unavailable. The broader video generation branch (thirteen papers across four leaves) and neighboring VLA methods (eleven papers) provide context but do not directly overlap with the latent-action encoding scheme described.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
25
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: fine-tuning video models for robot visuomotor control and planning. The field has evolved into a rich landscape organized around how visual and temporal information is leveraged for robotic decision-making. At the highest level, one major branch focuses on video generation models that learn policies directly from predicted or generated visual sequences, treating video prediction as a planning substrate. Another prominent direction explores vision-language-action (VLA) model adaptation, where large pre-trained multimodal models are fine-tuned to map language instructions and visual observations to robot actions. Pre-training and transfer learning methods form a third pillar, emphasizing how representations learned from diverse data sources can be adapted to downstream manipulation tasks. Diffusion models have emerged as a distinct branch, applying iterative denoising frameworks to generate action trajectories or visual plans. Language model-based control, visual representation learning, and imitation learning from demonstrations each carve out their own methodological niches, while specialized applications target domains like surgery or navigation. Supporting all of these are benchmarks and datasets that provide standardized evaluation, and techniques for aligning generative models to human preferences or task-specific objectives. Within the video generation branch, a particularly active line of work explores direct video-to-action policy learning, where models predict future visual states and extract control signals without explicit action labels during pre-training. Cosmos Policy[0] exemplifies this approach by fine-tuning a video generation model to produce both visual rollouts and corresponding actions, closely related to methods like Universal Text-Guided[5] and Dreamitate[50] that similarly leverage video prediction for planning. In contrast, works such as VLM Predictive Control[4] and Geometry-aware 4D[3] emphasize integrating geometric reasoning or closed-loop feedback into video-based planning, highlighting a trade-off between open-loop generation and reactive control. Across the taxonomy, a recurring theme is the tension between leveraging large-scale pre-trained models—whether video generators, VLAs like Fine-tuning VLA[1], or diffusion-based planners surveyed in Diffusion Robotic Survey[2]—and the need for efficient adaptation to embodied settings with limited robot data. Cosmos Policy[0] sits squarely in the video generation lineage, sharing the philosophy of Universal Text-Guided[5] that visual prediction can serve as a powerful prior for control, while differing from VLA approaches that rely more heavily on language grounding and action tokenization.

Claimed Contributions

Cosmos Policy: single-stage fine-tuning approach for video-based robot policies

The authors propose Cosmos Policy, which adapts a pretrained video generation model into a robot policy via single-stage fine-tuning without architectural changes. This contrasts with prior works that require multiple training stages and new architectural components for action generation.

10 retrieved papers
Latent frame injection mechanism for incorporating multiple modalities

The authors introduce latent frame injection, a mechanism that encodes robot actions, proprioception, state values, and multiple camera views as latent frames within the video model's native diffusion process. This enables the model to handle new modalities without architectural modifications.

10 retrieved papers
Unified joint training of policy, world model, and value function

The authors develop a unified training approach where a single model simultaneously learns to predict actions, future states, and state values through the video diffusion objective. This enables test-time planning via best-of-N sampling using predicted future states and values.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Cosmos Policy: single-stage fine-tuning approach for video-based robot policies

The authors propose Cosmos Policy, which adapts a pretrained video generation model into a robot policy via single-stage fine-tuning without architectural changes. This contrasts with prior works that require multiple training stages and new architectural components for action generation.

Contribution

Latent frame injection mechanism for incorporating multiple modalities

The authors introduce latent frame injection, a mechanism that encodes robot actions, proprioception, state values, and multiple camera views as latent frames within the video model's native diffusion process. This enables the model to handle new modalities without architectural modifications.

Contribution

Unified joint training of policy, world model, and value function

The authors develop a unified training approach where a single model simultaneously learns to predict actions, future states, and state values through the video diffusion objective. This enables test-time planning via best-of-N sampling using predicted future states and values.

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning | Novelty Validation