Pusa V1.0: Unlocking Temporal Control in Pretrained Video Diffusion Models via Vectorized Timestep Adaptation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Vectorized TimestepsFlow MatchingTemporal ModelingVideo Generation

The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present \textbf{Pusa} V1.0, a versatile model that leverages \textbf{vectorized timestep adaptation (VTA)} to enable fine-grained temporal control within a unified video diffusion framework. Note that VTA is a non-destructive adaptation, which means that it fully preserves the capabilities of the base model. \textbf{Unlike conventional methods like Wan-I2V, which finetune a base text-to-video (T2V) model with abundant resources to do image-to-video (I2V), we achieve comparable results in a zero-shot manner after an ultra-efficient finetuning process based on VTA. Moreover, this method also unlocks many other zero-shot capabilities simultaneously, such as start-end frames and video extension ---all without task-specific training. Meanwhile, it keeps the T2V capability from the base model.} Mechanistic analyses also reveal that our approach preserves the foundation model's generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to the vectorized timestep. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Vectorized Timestep Adaptation (VTA) to enable fine-grained temporal control in video diffusion models, positioning itself within the 'Vectorized and Frame-Level Timestep Control' leaf of the taxonomy. This leaf contains only two papers total, including the original work, indicating a relatively sparse research direction. The core contribution centers on using independent per-frame noise schedules rather than scalar timesteps, allowing the model to achieve zero-shot image-to-video generation and other tasks without task-specific training. This approach contrasts with the broader field's tendency toward learned motion priors or global conditioning strategies.

The taxonomy reveals that temporal control research has diversified across multiple branches. Neighboring leaves include 'Temporal Attention and Recurrence Mechanisms' (three papers) and 'Space-Time Joint Generation Architectures' (two papers), which address coherence through architectural components rather than timestep manipulation. The 'Conditional Control Mechanisms' branch explores spatial guidance through edges, trajectories, and 3D priors, while 'Motion and Appearance Customization' focuses on learning reusable motion concepts. Pusa's vectorized timestep approach represents a distinct paradigm: embedding temporal control directly into the diffusion schedule rather than through learned representations or external conditioning signals.

Among twenty-eight candidates examined, the contribution-level analysis reveals mixed novelty signals. The VTA mechanism itself (Contribution 1) examined ten candidates with one appearing to provide overlapping prior work, suggesting some precedent for vectorized timestep concepts within the limited search scope. The unified multi-task framework (Contribution 2) examined nine candidates with one refutable match, indicating that zero-shot generalization approaches exist in related contexts. The Frame-Aware Flow Matching objective (Contribution 3) examined nine candidates with no clear refutations, suggesting this training formulation may be more distinctive within the sampled literature.

Based on the limited search scope of twenty-eight semantically similar papers, the work appears to occupy a sparsely populated research direction with some conceptual overlap in vectorized timestep ideas but potentially novel integration and training strategies. The analysis does not cover exhaustive prior work in video diffusion or temporal control more broadly, and the taxonomy structure suggests active parallel development in complementary approaches to temporal modeling.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: fine-grained temporal control in video diffusion models. The field has evolved around several complementary directions that together address how to generate coherent, controllable video sequences. Foundational Video Diffusion Models such as Lumiere[7] and Imagen Video[50] establish baseline architectures for temporal generation, while Temporal Modeling Architectures and Mechanisms explore how to represent and manipulate time at different granularities—ranging from frame-level timestep control (e.g., Vectorized Timestep[19]) to autoregressive and sequential generation strategies that build videos incrementally. Conditional Control Mechanisms and Interfaces introduce diverse input modalities, including text, sketches, and spatial signals, enabling users to steer content more precisely. Motion and Appearance Customization branches focus on disentangling and personalizing dynamic attributes, with works like MotionDirector[15] and MotionFlow[23] targeting motion-specific tuning. Meanwhile, Temporal Consistency and Coherence Enhancement addresses the challenge of maintaining stable object identity and smooth transitions across frames, and Training and Optimization Strategies investigate efficient learning paradigms, including methods like DenseDPO[3] that refine models via preference-based feedback. A particularly active line of work centers on achieving fine-grained control over when and how motion unfolds. Pusa[0] sits within the Vectorized and Frame-Level Timestep Control cluster, emphasizing per-frame manipulation of diffusion timesteps to modulate temporal dynamics precisely. This approach contrasts with methods that rely on global conditioning or coarse temporal segmentation, such as those in Conditional Control Mechanisms that apply uniform guidance across the entire sequence. Nearby works like Vectorized Timestep[19] share a similar philosophy of frame-wise parameterization, while others in Motion and Appearance Customization (e.g., MotionDirector[15]) focus more on learning reusable motion priors rather than explicit timestep modulation. The interplay between these branches highlights an ongoing tension: whether to embed temporal control directly into the diffusion schedule or to encode it through learned representations and conditioning signals. Pusa[0] exemplifies the former strategy, offering a complementary perspective to appearance-driven and motion-prior methods.

Claimed Contributions

Vectorized Timestep Adaptation (VTA) for efficient video diffusion model adaptation

Can Refute

10 retrieved papers

The authors propose a non-destructive adaptation method called Vectorized Timestep Adaptation (VTA) that inflates the scalar timestep variable of pretrained video diffusion models into a frame-level vector. This enables fine-grained temporal control while fully preserving the base model's capabilities, achieving state-of-the-art image-to-video performance with minimal training data and computational cost.

10 retrieved papers

Can Refute

Unified multi-task video generation framework with zero-shot generalization

Can Refute

9 retrieved papers

The authors develop a unified framework that simultaneously supports multiple video generation tasks including text-to-video, image-to-video, start-end frame conditioning, and video extension without requiring task-specific retraining. This zero-shot multi-task capability emerges from the flexible vectorized timestep control mechanism.

9 retrieved papers

Can Refute

Frame-Aware Flow Matching objective for vectorized timestep training

9 retrieved papers

The authors extend the Frame-Aware Video Diffusion Model paradigm to the flow matching framework by introducing a Frame-Aware Flow Matching (FAFM) objective. This formulation enables each video frame to evolve independently along its own probability path with frame-specific timesteps, avoiding the rigid synchronization of conventional video diffusion models.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[19] Redefining temporal modeling in video diffusion: The vectorized timestep approach PDF

Liu Yaofang, Ren, Yumeng, Yaofang Liu, Cun, Xiaodong, Yumeng Ren, Artola, Aitor, Xiaodong Cun, Liu Yang, Aitor Artola, Zeng, Tieyong, Yang Liu, Chan, Raymond H., Tieyong Zeng, Morel, Jean-Michel, Raymond H. Chan, Jean-Michel Morel (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Vectorized Timestep Adaptation (VTA) for efficient video diffusion model adaptation

[19] Redefining temporal modeling in video diffusion: The vectorized timestep approach PDF

Can Refute

[9] Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models PDF

Cannot Refute

[15] MotionDirector: Motion Customization of Text-to-Video Diffusion Models PDF

Cannot Refute

[16] VMC: Video Motion Customization Using Temporal Attention Adaption for Text-to-Video Diffusion Models PDF

Cannot Refute

[60] Timestep Embedding Tells: Itâs Time to Cache for Video Diffusion Model PDF

Cannot Refute

[61] Simda: Simple diffusion adapter for efficient video generation PDF

Cannot Refute

[62] Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps PDF

Cannot Refute

[63] Pioneering 4-Bit FP Quantization for Diffusion Models: Mixup-Sign Quantization and Timestep-Aware Fine-Tuning PDF

Cannot Refute

[64] Adadiff: Adaptive step selection for fast diffusion PDF

Cannot Refute

[65] Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization PDF

Cannot Refute

Contribution

Unified multi-task video generation framework with zero-shot generalization

[66] VideoPoet: A Large Language Model for Zero-Shot Video Generation PDF

Can Refute

[30] Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models PDF

Cannot Refute

[68] IM-Zero: Instance-level Motion Controllable Video Generation in a Zero-shot Manner PDF

Cannot Refute

[69] Fatezero: Fusing attentions for zero-shot text-based video editing PDF

Cannot Refute

[70] Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators PDF

Cannot Refute

[71] Univideo: Unified understanding, generation, and editing for videos PDF

Cannot Refute

[72] Univid: Unifying vision tasks with pre-trained video generation models PDF

Cannot Refute

[73] Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation PDF

Cannot Refute

[74] StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation PDF

Cannot Refute

Contribution

Frame-Aware Flow Matching objective for vectorized timestep training

[24] Conditional Image-to-Video Generation with Latent Flow Diffusion Models PDF

Cannot Refute

[51] Pyramidal Flow Matching for Efficient Video Generative Modeling PDF

Cannot Refute

[52] Moalign: Motion-centric representation alignment for video diffusion models PDF

Cannot Refute

[53] RoPECraft: Training-Free Motion Transfer with Trajectory-Guided RoPE Optimization on Diffusion Transformers PDF

Cannot Refute

[54] Depth Any Video with Scalable Synthetic Data PDF

Cannot Refute

[55] Vsrdiff: Learning inter-frame temporal coherence in diffusion model for video super-resolution PDF

Cannot Refute

[56] BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models PDF

Cannot Refute

[57] Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation PDF

Cannot Refute

[59] Efficient video prediction via sparsely conditioned flow matching PDF

Cannot Refute

Pusa V1.0: Unlocking Temporal Control in Pretrained Video Diffusion Models via Vectorized Timestep Adaptation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[19] Redefining temporal modeling in video diffusion: The vectorized timestep approach PDF

Contribution Analysis

Vectorized Timestep Adaptation (VTA) for efficient video diffusion model adaptation

[19] Redefining temporal modeling in video diffusion: The vectorized timestep approach PDF

[9] Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models PDF

[15] MotionDirector: Motion Customization of Text-to-Video Diffusion Models PDF

[16] VMC: Video Motion Customization Using Temporal Attention Adaption for Text-to-Video Diffusion Models PDF

[60] Timestep Embedding Tells: Itâs Time to Cache for Video Diffusion Model PDF

[61] Simda: Simple diffusion adapter for efficient video generation PDF

[62] Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps PDF

[63] Pioneering 4-Bit FP Quantization for Diffusion Models: Mixup-Sign Quantization and Timestep-Aware Fine-Tuning PDF

[64] Adadiff: Adaptive step selection for fast diffusion PDF

[65] Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization PDF

Unified multi-task video generation framework with zero-shot generalization

[66] VideoPoet: A Large Language Model for Zero-Shot Video Generation PDF

[30] Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models PDF

[68] IM-Zero: Instance-level Motion Controllable Video Generation in a Zero-shot Manner PDF

[69] Fatezero: Fusing attentions for zero-shot text-based video editing PDF

[70] Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators PDF

[71] Univideo: Unified understanding, generation, and editing for videos PDF

[72] Univid: Unifying vision tasks with pre-trained video generation models PDF

[73] Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation PDF

[74] StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation PDF

Frame-Aware Flow Matching objective for vectorized timestep training

[24] Conditional Image-to-Video Generation with Latent Flow Diffusion Models PDF

[51] Pyramidal Flow Matching for Efficient Video Generative Modeling PDF

[52] Moalign: Motion-centric representation alignment for video diffusion models PDF

[53] RoPECraft: Training-Free Motion Transfer with Trajectory-Guided RoPE Optimization on Diffusion Transformers PDF

[54] Depth Any Video with Scalable Synthetic Data PDF

[55] Vsrdiff: Learning inter-frame temporal coherence in diffusion model for video super-resolution PDF

[56] BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models PDF

[57] Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation PDF

[59] Efficient video prediction via sparsely conditioned flow matching PDF

Table of Contents

[60] Timestep Embedding Tells: Itâs Time to Cache for Video Diffusion Model PDF