Towards One-step Causal Video Generation via Adversarial Self-Distillation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Diffusion DistillationCausal Text to Video Generation

Recent hybrid video generation models combine autoregressive temporal dynamics with diffusion-based spatial denoising, but their sequential, iterative nature leads to error accumulation and long inference times. In this work, we propose a distillation-based framework for efficient causal video generation that enables high-quality synthesis with extreme limited denoising steps. Our approach builds upon Distribution Matching Distillation (DMD) framework and proposes a novel form of Adversarial Self-Distillation (ASD) strategy, which aligns the outputs of the student model's $n$ -step denoising process with its $(n+1)$ -step version in the distribution level. This design provides smoother supervision by bridging small intra-student gaps and more informative guidance by combining teacher knowledge with locally consistent student behavior, substantially improving training stability and generation quality in extremely few-step scenarios. In addition, we present a First-Frame Enhancement (FFE) strategy, which allocates more denoising steps to the initial frames to mitigate error propagation while applying larger skipping steps to later frames. Extensive experiments on VBench demonstrate that our method surpasses state-of-the-art approaches in both one-step and two-step video generation. Notably, our framework produces a single distilled model that flexibly supports multiple inference-step settings, eliminating the need for repeated re-distillation and enabling efficient, high-quality video synthesis.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a distillation framework for efficient causal video generation, introducing Adversarial Self-Distillation (ASD) and First-Frame Enhancement (FFE) strategies to reduce denoising steps. It resides in the 'Bidirectional-to-Causal Architecture Adaptation' leaf, which contains only three papers total, including this work and two siblings. This leaf focuses on converting pretrained bidirectional diffusion transformers into causal autoregressive generators. The small sibling count suggests a relatively sparse research direction within the broader taxonomy of nine papers across seven leaf nodes.

The taxonomy reveals neighboring approaches in 'Acceleration via Distillation and Step Reduction' (two papers) and 'Parallel and Streaming Generation Strategies' (two papers). The paper's leaf excludes methods designing autoregressive models from scratch or using masked autoregression, which belong to the 'Masked Autoregressive Planning' leaf (one paper). The work bridges its bidirectional-to-causal focus with distillation techniques from the acceleration branch, while differing from streaming methods that prioritize real-time generation over step reduction. This positioning suggests the paper combines architectural adaptation with distillation-driven speedup, straddling two related but distinct research directions.

Among 25 candidates examined, the analysis found refutable prior work for all three contributions. The ASD strategy examined 10 candidates with 1 refutable match, suggesting some overlap in self-distillation concepts. FFE examined 5 candidates with 1 refutable match, indicating prior work on adaptive step allocation. The step-unified distillation framework examined 10 candidates with 4 refutable matches, the highest overlap count, suggesting this contribution has more substantial precedent. The limited search scope (25 papers, not exhaustive) means these statistics reflect top-K semantic matches rather than comprehensive field coverage.

Based on the limited search scope, the work appears to combine established distillation ideas with architectural adaptation in a moderately explored area. The taxonomy's sparse structure (nine total papers) and the contribution-level statistics (6 total refutable pairs across 25 candidates) suggest incremental refinement rather than foundational novelty. However, the analysis covers only top-K semantic matches and does not capture potential novelty in implementation details, experimental validation, or combinations of techniques.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient causal video generation with minimal denoising steps. The field addresses the challenge of producing temporally coherent video sequences in a causally consistent manner while reducing computational overhead. The taxonomy reveals four main branches: Autoregressive Temporal Modeling with Diffusion Synthesis, which combines sequential frame generation with diffusion-based synthesis; Acceleration via Distillation and Step Reduction, focusing on compressing multi-step diffusion into fewer iterations; Parallel and Streaming Generation Strategies, exploring methods that generate multiple frames or segments concurrently; and Neural ODE-Based Temporal Flow Modeling, leveraging continuous-time dynamics for smoother temporal transitions. Works like Fast Autoregressive Video[1] and Progressive Autoregressive Video[2] exemplify autoregressive approaches, while Mardini[3] and DiT Acceleration[9] illustrate distillation-driven speedups. These branches collectively aim to balance generation quality, temporal consistency, and inference speed. A particularly active line of work centers on adapting bidirectional architectures for causal generation, where models originally trained on full-sequence contexts are repurposed for sequential, left-to-right synthesis. Causal Video Distillation[0] sits within this cluster, emphasizing the conversion of bidirectional diffusion models into efficient causal generators through distillation techniques. This contrasts with Fast Causal Generators[6], which may prioritize architectural redesign from scratch, and Fast Autoregressive Video[1], which focuses on autoregressive refinement without necessarily distilling from a bidirectional source. The main trade-off across these approaches involves whether to leverage pre-trained bidirectional knowledge or to train causal models de novo, and how aggressively to reduce denoising steps without sacrificing frame-to-frame coherence. Open questions remain around optimal distillation schedules and the extent to which parallel strategies can be integrated into inherently sequential causal pipelines.

Claimed Contributions

Adversarial Self-Distillation (ASD) strategy for few-step video generation

Can Refute

10 retrieved papers

The authors introduce a novel distillation method that uses a discriminator to align the n-step and (n+1)-step denoising distributions of a single student model. This approach provides smoother supervision by bridging smaller step-to-step gaps and enables a single model to support multiple inference-step configurations without separate re-distillation.

10 retrieved papers

Can Refute

First-Frame Enhancement (FFE) inference strategy

Can Refute

5 retrieved papers

The authors present a frame-wise inference approach that assigns more intensive denoising steps to the first frame while using larger skipping steps for subsequent frames. This strategy mitigates error propagation in causal video generation while maintaining low computational cost.

5 retrieved papers

Can Refute

Step-unified distillation framework for flexible inference

Can Refute

10 retrieved papers

The authors develop a unified framework where one distilled model can operate at various inference step counts (e.g., 1-step, 2-step, 4-step) without requiring separate training for each configuration. This design improves practical usability by removing the need for repetitive re-distillation across different step settings.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] From slow bidirectional to fast autoregressive video diffusion models PDF

Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, FrÃ©do Durand, Eli Shechtman, Xun Huang (2025)

[6] From Slow Bidirectional to Fast Causal Video Generators PDF

Yin, Tianwei, Zhang Qiang, Tianwei Yin, Zhang, Richard, Qiang Zhang, Freeman, William T., Richard Zhang, Durand, Fredo, William T. Freeman, Shechtman, Eli, FrÃ©do Durand, Huang Xun, Eli Shechtman, Xun Huang (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Adversarial Self-Distillation (ASD) strategy for few-step video generation

[12] Adversarial diffusion distillation PDF

Can Refute

[11] On Distillation of Guided Diffusion Models PDF

Cannot Refute

[13] Learnable sampler distillation for discrete diffusion models PDF

Cannot Refute

[17] Sana-sprint: One-step diffusion with continuous-time consistency distillation PDF

Cannot Refute

[20] One-Step Diffusion with Distribution Matching Distillation PDF

Cannot Refute

[21] Improved Distribution Matching Distillation for Fast Image Synthesis PDF

Cannot Refute

[22] Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model PDF

Cannot Refute

[23] BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping PDF

Cannot Refute

[24] Diffusion Self-Distillation for Zero-Shot Customized Image Generation PDF

Cannot Refute

[25] AddSR: Accelerating Diffusion-based Blind Super-Resolution with Adversarial Diffusion Distillation PDF

Cannot Refute

Contribution

First-Frame Enhancement (FFE) inference strategy

[29] JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion PDF

Can Refute

[26] Packing input frame context in next-frame prediction models for video generation PDF

Cannot Refute

[27] Ar-diffusion: Asynchronous video generation with auto-regressive diffusion PDF

Cannot Refute

[28] Redefining temporal modeling in video diffusion: The vectorized timestep approach PDF

Cannot Refute

[30] TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model PDF

Cannot Refute

Contribution

Step-unified distillation framework for flexible inference

[14] Simple and fast distillation of diffusion models PDF

Can Refute

[16] SFDDM: Single-fold Distillation for Diffusion models PDF

Can Refute

[17] Sana-sprint: One-step diffusion with continuous-time consistency distillation PDF

Can Refute

[18] Multistep Distillation of Diffusion Models via Moment Matching PDF

Can Refute

[10] Progressive distillation for fast sampling of diffusion models PDF

Cannot Refute

[11] On Distillation of Guided Diffusion Models PDF

Cannot Refute

[12] Adversarial diffusion distillation PDF

Cannot Refute

[13] Learnable sampler distillation for discrete diffusion models PDF

Cannot Refute

[15] Generative dataset distillation based on diffusion model PDF

Cannot Refute

[19] Direct Distillation: A Novel Approach for Efficient Diffusion Model Inference PDF

Cannot Refute

Towards One-step Causal Video Generation via Adversarial Self-Distillation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] From slow bidirectional to fast autoregressive video diffusion models PDF

[6] From Slow Bidirectional to Fast Causal Video Generators PDF

Contribution Analysis

Adversarial Self-Distillation (ASD) strategy for few-step video generation

[12] Adversarial diffusion distillation PDF

[11] On Distillation of Guided Diffusion Models PDF

[13] Learnable sampler distillation for discrete diffusion models PDF

[17] Sana-sprint: One-step diffusion with continuous-time consistency distillation PDF

[20] One-Step Diffusion with Distribution Matching Distillation PDF

[21] Improved Distribution Matching Distillation for Fast Image Synthesis PDF

[22] Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model PDF

[23] BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping PDF

[24] Diffusion Self-Distillation for Zero-Shot Customized Image Generation PDF

[25] AddSR: Accelerating Diffusion-based Blind Super-Resolution with Adversarial Diffusion Distillation PDF

First-Frame Enhancement (FFE) inference strategy

[29] JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion PDF

[26] Packing input frame context in next-frame prediction models for video generation PDF

[27] Ar-diffusion: Asynchronous video generation with auto-regressive diffusion PDF

[28] Redefining temporal modeling in video diffusion: The vectorized timestep approach PDF

[30] TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model PDF

Step-unified distillation framework for flexible inference

[14] Simple and fast distillation of diffusion models PDF

[16] SFDDM: Single-fold Distillation for Diffusion models PDF

[17] Sana-sprint: One-step diffusion with continuous-time consistency distillation PDF

[18] Multistep Distillation of Diffusion Models via Moment Matching PDF

[10] Progressive distillation for fast sampling of diffusion models PDF

[11] On Distillation of Guided Diffusion Models PDF

[12] Adversarial diffusion distillation PDF

[13] Learnable sampler distillation for discrete diffusion models PDF

[15] Generative dataset distillation based on diffusion model PDF

[19] Direct Distillation: A Novel Approach for Efficient Diffusion Model Inference PDF

Table of Contents