Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Overview
Overall Novelty Assessment
Cosmos Policy proposes a single-stage fine-tuning approach that adapts a large pretrained video generation model (Cosmos-Predict2) to generate robot actions as latent frames within a diffusion process. It resides in the 'Direct Video-to-Action Policy Learning' leaf, which contains only three papers including the original. This leaf sits within the broader 'Video Generation Models for Robot Policy Learning' branch, indicating a relatively focused but not overcrowded research direction. The taxonomy shows four sibling leaves under video generation (direct action, trajectory planning, world models, evaluation), suggesting this is an active area with multiple methodological variants.
The taxonomy reveals that direct video-to-action methods neighbor 'Video Generation for Trajectory Planning and Simulation' (four papers) and 'Video-Based World Models for Control' (four papers), which emphasize explicit trajectory synthesis or dynamics modeling rather than end-to-end action generation. Parallel branches like 'Vision-Language-Action Model Adaptation' (eleven papers across five leaves) and 'Diffusion Models for Robot Manipulation' represent alternative paradigms that integrate language grounding or iterative denoising frameworks. Cosmos Policy's approach diverges from VLA methods by avoiding language tokenization and from diffusion-based trajectory planners by embedding actions directly in video latents.
Among twenty-five candidates examined across three contributions, none clearly refute the proposed approach. The first contribution (single-stage fine-tuning) examined ten candidates with zero refutable matches; the second (latent frame injection) similarly found no overlapping prior work among ten candidates; the third (unified joint training) reviewed five candidates without refutation. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific combination of single-stage adaptation, latent frame encoding, and joint policy-world model-value training appears less explored. However, the search scale (twenty-five papers) is modest relative to the fifty-paper taxonomy.
Given the restricted literature search and the sparse population of the immediate taxonomy leaf (three papers total), the analysis indicates potential novelty but cannot claim exhaustive coverage. The sibling papers in 'Direct Video-to-Action Policy Learning' likely represent the closest prior work, yet detailed comparison statistics are unavailable. The broader video generation branch (thirteen papers across four leaves) and neighboring VLA methods (eleven papers) provide context but do not directly overlap with the latent-action encoding scheme described.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose Cosmos Policy, which adapts a pretrained video generation model into a robot policy via single-stage fine-tuning without architectural changes. This contrasts with prior works that require multiple training stages and new architectural components for action generation.
The authors introduce latent frame injection, a mechanism that encodes robot actions, proprioception, state values, and multiple camera views as latent frames within the video model's native diffusion process. This enables the model to handle new modalities without architectural modifications.
The authors develop a unified training approach where a single model simultaneously learns to predict actions, future states, and state values through the video diffusion objective. This enables test-time planning via best-of-N sampling using predicted future states and values.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[5] Learning universal policies via text-guided video generation PDF
[50] Dreamitate: Real-World Visuomotor Policy Learning via Video Generation PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Cosmos Policy: single-stage fine-tuning approach for video-based robot policies
The authors propose Cosmos Policy, which adapts a pretrained video generation model into a robot policy via single-stage fine-tuning without architectural changes. This contrasts with prior works that require multiple training stages and new architectural components for action generation.
[1] Fine-tuning vision-language-action models: Optimizing speed and success PDF
[40] Robotic-clip: Fine-tuning clip on action data for robotic applications PDF
[51] Octo: An open-source generalist robot policy PDF
[52] Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation PDF
[53] GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation PDF
[54] Hand-object interaction pretraining from videos PDF
[55] Lift3d foundation policy: Lifting 2d large-scale pretrained models for robust 3d robotic manipulation PDF
[56] Navila: Legged robot vision-language-action model for navigation PDF
[57] Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction PDF
[58] Vidar: Embodied video diffusion model for generalist manipulation PDF
Latent frame injection mechanism for incorporating multiple modalities
The authors introduce latent frame injection, a mechanism that encodes robot actions, proprioception, state values, and multiple camera views as latent frames within the video model's native diffusion process. This enables the model to handle new modalities without architectural modifications.
[59] Vid-gpt: Introducing gpt-style autoregressive generation in video diffusion models PDF
[60] LDMVFI: Video Frame Interpolation with Latent Diffusion Models PDF
[61] MagicVideo: Efficient Video Generation With Latent Diffusion Models PDF
[62] Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models PDF
[63] Latent-reframe: Enabling camera control for video diffusion models without training PDF
[64] Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition PDF
[65] Videogen: A reference-guided latent diffusion approach for high definition text-to-video generation PDF
[66] Signgen: End-to-end sign language video generation with latent diffusion PDF
[67] VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate PDF
[68] Latent Flow Diffusion for Deepfake Video Generation PDF
Unified joint training of policy, world model, and value function
The authors develop a unified training approach where a single model simultaneously learns to predict actions, future states, and state values through the video diffusion objective. This enables test-time planning via best-of-N sampling using predicted future states and values.