Towards One-step Causal Video Generation via Adversarial Self-Distillation

ICLR 2026 Conference SubmissionAnonymous Authors
Diffusion DistillationCausal Text to Video Generation
Abstract:

Recent hybrid video generation models combine autoregressive temporal dynamics with diffusion-based spatial denoising, but their sequential, iterative nature leads to error accumulation and long inference times. In this work, we propose a distillation-based framework for efficient causal video generation that enables high-quality synthesis with extreme limited denoising steps. Our approach builds upon Distribution Matching Distillation (DMD) framework and proposes a novel form of Adversarial Self-Distillation (ASD) strategy, which aligns the outputs of the student model's nn-step denoising process with its (n+1)(n+1)-step version in the distribution level. This design provides smoother supervision by bridging small intra-student gaps and more informative guidance by combining teacher knowledge with locally consistent student behavior, substantially improving training stability and generation quality in extremely few-step scenarios. In addition, we present a First-Frame Enhancement (FFE) strategy, which allocates more denoising steps to the initial frames to mitigate error propagation while applying larger skipping steps to later frames. Extensive experiments on VBench demonstrate that our method surpasses state-of-the-art approaches in both one-step and two-step video generation. Notably, our framework produces a single distilled model that flexibly supports multiple inference-step settings, eliminating the need for repeated re-distillation and enabling efficient, high-quality video synthesis.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a distillation framework for efficient causal video generation, introducing Adversarial Self-Distillation (ASD) and First-Frame Enhancement (FFE) strategies to reduce denoising steps. It resides in the 'Bidirectional-to-Causal Architecture Adaptation' leaf, which contains only three papers total, including this work and two siblings. This leaf focuses on converting pretrained bidirectional diffusion transformers into causal autoregressive generators. The small sibling count suggests a relatively sparse research direction within the broader taxonomy of nine papers across seven leaf nodes.

The taxonomy reveals neighboring approaches in 'Acceleration via Distillation and Step Reduction' (two papers) and 'Parallel and Streaming Generation Strategies' (two papers). The paper's leaf excludes methods designing autoregressive models from scratch or using masked autoregression, which belong to the 'Masked Autoregressive Planning' leaf (one paper). The work bridges its bidirectional-to-causal focus with distillation techniques from the acceleration branch, while differing from streaming methods that prioritize real-time generation over step reduction. This positioning suggests the paper combines architectural adaptation with distillation-driven speedup, straddling two related but distinct research directions.

Among 25 candidates examined, the analysis found refutable prior work for all three contributions. The ASD strategy examined 10 candidates with 1 refutable match, suggesting some overlap in self-distillation concepts. FFE examined 5 candidates with 1 refutable match, indicating prior work on adaptive step allocation. The step-unified distillation framework examined 10 candidates with 4 refutable matches, the highest overlap count, suggesting this contribution has more substantial precedent. The limited search scope (25 papers, not exhaustive) means these statistics reflect top-K semantic matches rather than comprehensive field coverage.

Based on the limited search scope, the work appears to combine established distillation ideas with architectural adaptation in a moderately explored area. The taxonomy's sparse structure (nine total papers) and the contribution-level statistics (6 total refutable pairs across 25 candidates) suggest incremental refinement rather than foundational novelty. However, the analysis covers only top-K semantic matches and does not capture potential novelty in implementation details, experimental validation, or combinations of techniques.

Taxonomy

Core-task Taxonomy Papers
9
3
Claimed Contributions
25
Contribution Candidate Papers Compared
6
Refutable Paper

Research Landscape Overview

Core task: efficient causal video generation with minimal denoising steps. The field addresses the challenge of producing temporally coherent video sequences in a causally consistent manner while reducing computational overhead. The taxonomy reveals four main branches: Autoregressive Temporal Modeling with Diffusion Synthesis, which combines sequential frame generation with diffusion-based synthesis; Acceleration via Distillation and Step Reduction, focusing on compressing multi-step diffusion into fewer iterations; Parallel and Streaming Generation Strategies, exploring methods that generate multiple frames or segments concurrently; and Neural ODE-Based Temporal Flow Modeling, leveraging continuous-time dynamics for smoother temporal transitions. Works like Fast Autoregressive Video[1] and Progressive Autoregressive Video[2] exemplify autoregressive approaches, while Mardini[3] and DiT Acceleration[9] illustrate distillation-driven speedups. These branches collectively aim to balance generation quality, temporal consistency, and inference speed. A particularly active line of work centers on adapting bidirectional architectures for causal generation, where models originally trained on full-sequence contexts are repurposed for sequential, left-to-right synthesis. Causal Video Distillation[0] sits within this cluster, emphasizing the conversion of bidirectional diffusion models into efficient causal generators through distillation techniques. This contrasts with Fast Causal Generators[6], which may prioritize architectural redesign from scratch, and Fast Autoregressive Video[1], which focuses on autoregressive refinement without necessarily distilling from a bidirectional source. The main trade-off across these approaches involves whether to leverage pre-trained bidirectional knowledge or to train causal models de novo, and how aggressively to reduce denoising steps without sacrificing frame-to-frame coherence. Open questions remain around optimal distillation schedules and the extent to which parallel strategies can be integrated into inherently sequential causal pipelines.

Claimed Contributions

Adversarial Self-Distillation (ASD) strategy for few-step video generation

The authors introduce a novel distillation method that uses a discriminator to align the n-step and (n+1)-step denoising distributions of a single student model. This approach provides smoother supervision by bridging smaller step-to-step gaps and enables a single model to support multiple inference-step configurations without separate re-distillation.

10 retrieved papers
Can Refute
First-Frame Enhancement (FFE) inference strategy

The authors present a frame-wise inference approach that assigns more intensive denoising steps to the first frame while using larger skipping steps for subsequent frames. This strategy mitigates error propagation in causal video generation while maintaining low computational cost.

5 retrieved papers
Can Refute
Step-unified distillation framework for flexible inference

The authors develop a unified framework where one distilled model can operate at various inference step counts (e.g., 1-step, 2-step, 4-step) without requiring separate training for each configuration. This design improves practical usability by removing the need for repetitive re-distillation across different step settings.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Adversarial Self-Distillation (ASD) strategy for few-step video generation

The authors introduce a novel distillation method that uses a discriminator to align the n-step and (n+1)-step denoising distributions of a single student model. This approach provides smoother supervision by bridging smaller step-to-step gaps and enables a single model to support multiple inference-step configurations without separate re-distillation.

Contribution

First-Frame Enhancement (FFE) inference strategy

The authors present a frame-wise inference approach that assigns more intensive denoising steps to the first frame while using larger skipping steps for subsequent frames. This strategy mitigates error propagation in causal video generation while maintaining low computational cost.

Contribution

Step-unified distillation framework for flexible inference

The authors develop a unified framework where one distilled model can operate at various inference step counts (e.g., 1-step, 2-step, 4-step) without requiring separate training for each configuration. This design improves practical usability by removing the need for repetitive re-distillation across different step settings.