Pusa V1.0: Unlocking Temporal Control in Pretrained Video Diffusion Models via Vectorized Timestep Adaptation

ICLR 2026 Conference SubmissionAnonymous Authors
Vectorized TimestepsFlow MatchingTemporal ModelingVideo Generation
Abstract:

The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present \textbf{Pusa} V1.0, a versatile model that leverages \textbf{vectorized timestep adaptation (VTA)} to enable fine-grained temporal control within a unified video diffusion framework. Note that VTA is a non-destructive adaptation, which means that it fully preserves the capabilities of the base model. \textbf{Unlike conventional methods like Wan-I2V, which finetune a base text-to-video (T2V) model with abundant resources to do image-to-video (I2V), we achieve comparable results in a zero-shot manner after an ultra-efficient finetuning process based on VTA. Moreover, this method also unlocks many other zero-shot capabilities simultaneously, such as start-end frames and video extension ---all without task-specific training. Meanwhile, it keeps the T2V capability from the base model.} Mechanistic analyses also reveal that our approach preserves the foundation model's generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to the vectorized timestep. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Vectorized Timestep Adaptation (VTA) to enable fine-grained temporal control in video diffusion models, positioning itself within the 'Vectorized and Frame-Level Timestep Control' leaf of the taxonomy. This leaf contains only two papers total, including the original work, indicating a relatively sparse research direction. The core contribution centers on using independent per-frame noise schedules rather than scalar timesteps, allowing the model to achieve zero-shot image-to-video generation and other tasks without task-specific training. This approach contrasts with the broader field's tendency toward learned motion priors or global conditioning strategies.

The taxonomy reveals that temporal control research has diversified across multiple branches. Neighboring leaves include 'Temporal Attention and Recurrence Mechanisms' (three papers) and 'Space-Time Joint Generation Architectures' (two papers), which address coherence through architectural components rather than timestep manipulation. The 'Conditional Control Mechanisms' branch explores spatial guidance through edges, trajectories, and 3D priors, while 'Motion and Appearance Customization' focuses on learning reusable motion concepts. Pusa's vectorized timestep approach represents a distinct paradigm: embedding temporal control directly into the diffusion schedule rather than through learned representations or external conditioning signals.

Among twenty-eight candidates examined, the contribution-level analysis reveals mixed novelty signals. The VTA mechanism itself (Contribution 1) examined ten candidates with one appearing to provide overlapping prior work, suggesting some precedent for vectorized timestep concepts within the limited search scope. The unified multi-task framework (Contribution 2) examined nine candidates with one refutable match, indicating that zero-shot generalization approaches exist in related contexts. The Frame-Aware Flow Matching objective (Contribution 3) examined nine candidates with no clear refutations, suggesting this training formulation may be more distinctive within the sampled literature.

Based on the limited search scope of twenty-eight semantically similar papers, the work appears to occupy a sparsely populated research direction with some conceptual overlap in vectorized timestep ideas but potentially novel integration and training strategies. The analysis does not cover exhaustive prior work in video diffusion or temporal control more broadly, and the taxonomy structure suggests active parallel development in complementary approaches to temporal modeling.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: fine-grained temporal control in video diffusion models. The field has evolved around several complementary directions that together address how to generate coherent, controllable video sequences. Foundational Video Diffusion Models such as Lumiere[7] and Imagen Video[50] establish baseline architectures for temporal generation, while Temporal Modeling Architectures and Mechanisms explore how to represent and manipulate time at different granularities—ranging from frame-level timestep control (e.g., Vectorized Timestep[19]) to autoregressive and sequential generation strategies that build videos incrementally. Conditional Control Mechanisms and Interfaces introduce diverse input modalities, including text, sketches, and spatial signals, enabling users to steer content more precisely. Motion and Appearance Customization branches focus on disentangling and personalizing dynamic attributes, with works like MotionDirector[15] and MotionFlow[23] targeting motion-specific tuning. Meanwhile, Temporal Consistency and Coherence Enhancement addresses the challenge of maintaining stable object identity and smooth transitions across frames, and Training and Optimization Strategies investigate efficient learning paradigms, including methods like DenseDPO[3] that refine models via preference-based feedback. A particularly active line of work centers on achieving fine-grained control over when and how motion unfolds. Pusa[0] sits within the Vectorized and Frame-Level Timestep Control cluster, emphasizing per-frame manipulation of diffusion timesteps to modulate temporal dynamics precisely. This approach contrasts with methods that rely on global conditioning or coarse temporal segmentation, such as those in Conditional Control Mechanisms that apply uniform guidance across the entire sequence. Nearby works like Vectorized Timestep[19] share a similar philosophy of frame-wise parameterization, while others in Motion and Appearance Customization (e.g., MotionDirector[15]) focus more on learning reusable motion priors rather than explicit timestep modulation. The interplay between these branches highlights an ongoing tension: whether to embed temporal control directly into the diffusion schedule or to encode it through learned representations and conditioning signals. Pusa[0] exemplifies the former strategy, offering a complementary perspective to appearance-driven and motion-prior methods.

Claimed Contributions

Vectorized Timestep Adaptation (VTA) for efficient video diffusion model adaptation

The authors propose a non-destructive adaptation method called Vectorized Timestep Adaptation (VTA) that inflates the scalar timestep variable of pretrained video diffusion models into a frame-level vector. This enables fine-grained temporal control while fully preserving the base model's capabilities, achieving state-of-the-art image-to-video performance with minimal training data and computational cost.

10 retrieved papers
Can Refute
Unified multi-task video generation framework with zero-shot generalization

The authors develop a unified framework that simultaneously supports multiple video generation tasks including text-to-video, image-to-video, start-end frame conditioning, and video extension without requiring task-specific retraining. This zero-shot multi-task capability emerges from the flexible vectorized timestep control mechanism.

9 retrieved papers
Can Refute
Frame-Aware Flow Matching objective for vectorized timestep training

The authors extend the Frame-Aware Video Diffusion Model paradigm to the flow matching framework by introducing a Frame-Aware Flow Matching (FAFM) objective. This formulation enables each video frame to evolve independently along its own probability path with frame-specific timesteps, avoiding the rigid synchronization of conventional video diffusion models.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Vectorized Timestep Adaptation (VTA) for efficient video diffusion model adaptation

The authors propose a non-destructive adaptation method called Vectorized Timestep Adaptation (VTA) that inflates the scalar timestep variable of pretrained video diffusion models into a frame-level vector. This enables fine-grained temporal control while fully preserving the base model's capabilities, achieving state-of-the-art image-to-video performance with minimal training data and computational cost.

Contribution

Unified multi-task video generation framework with zero-shot generalization

The authors develop a unified framework that simultaneously supports multiple video generation tasks including text-to-video, image-to-video, start-end frame conditioning, and video extension without requiring task-specific retraining. This zero-shot multi-task capability emerges from the flexible vectorized timestep control mechanism.

Contribution

Frame-Aware Flow Matching objective for vectorized timestep training

The authors extend the Frame-Aware Video Diffusion Model paradigm to the flow matching framework by introducing a Frame-Aware Flow Matching (FAFM) objective. This formulation enables each video frame to evolve independently along its own probability path with frame-specific timesteps, avoiding the rigid synchronization of conventional video diffusion models.