Greedy Distill: Efficient Video Generative Modeling with Linear Time Complexity

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Video GenerationDiffusion based models Trmodel distillation

Due to bidirectional attention dependencies, video generation models generally suffer from $O(n^2)$ computational complexity. In this work, we find the “local inter-frame information redundancy" phenomenon which indicates strong local temporal dependencies in video generation, with global attention to distant frames contributing only marginally. Built upon this finding, we introduce a novel distillation training paradigm for video diffusion models, namely GREEDY DISTILL. Specifically, to generate the next frame using only the 0-th and the last frames, we propose the Streaming Diffusion Decoder (SDD) as the “Greedy Decoder" to avoid redundant computational costs from the other frames. Meanwhile, to our knowledge, we introduce Efficient Temporal Module (ETM) to capture the global temporal information across frames. These two modules achieve the computational complexity reduction from $O(n^2)$ to linear. Moreover, we make the first attempt to apply RL fine-tuning to address the error accumulation during streaming generation. Our method achieves an overall score of 84.60 on the VBench benchmark, surpassing previous state-of-the-art methods by large margins(+4.18%). Qualitative results also demonstrate superior performance. Leveraging its efficient model structure and KV cache, it is able to rapidly generate high-quality video streams at 24 FPS (nearly 50% faster) on a single H100 GPU.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a distillation paradigm for video diffusion models that reduces computational complexity from quadratic to linear by leveraging local temporal redundancy. It proposes the Streaming Diffusion Decoder (SDD) and Efficient Temporal Module (ETM) to generate frames using only the initial and last frames, avoiding redundant computation. Within the taxonomy, it resides in the 'Distillation and Few-Step Generation' leaf under 'Inference-Time Optimization and Acceleration', sharing this category with only one sibling paper. This indicates a relatively sparse research direction focused specifically on distillation-based inference acceleration, though the broader parent branch of inference-time optimization is more populated.

The paper's position connects to several neighboring research directions. The 'Test-Time Scaling and Optimization' leaf explores iterative refinement strategies that contrast with the few-step approach here. The 'Latent Space Compression' branch (including Latent Video Diffusion and MAGVIT) addresses efficiency through representation compression rather than inference distillation, representing a complementary strategy. The 'Linear Attention and Block-Based Architectures' subtopic under 'Computational Complexity Reduction via Attention Mechanisms' tackles quadratic complexity through architectural modifications rather than distillation. The taxonomy's scope notes clarify that this work belongs in inference optimization rather than training-time or architectural categories, distinguishing it from parameter-efficient adaptation methods.

Among the three contributions analyzed, the core distillation paradigm examined three candidates with none providing clear refutation, suggesting novelty in the specific greedy frame selection strategy. The SDD and ETM modules examined ten candidates without refutation, indicating these architectural components may represent new designs within the limited search scope. The reinforcement learning fine-tuning contribution examined nine candidates and found one potentially refutable prior work, suggesting this aspect has more substantial overlap with existing techniques. The analysis covered twenty-two total candidates from semantic search, providing a focused but not exhaustive view of the literature landscape.

Based on the limited search scope of twenty-two candidates, the work appears to introduce novel mechanisms for streaming video generation through distillation, particularly in the greedy decoder design. The reinforcement learning component shows more connection to prior work, which is expected given RL's established use in addressing sequential generation errors. The taxonomy structure reveals this sits in a less crowded inference-optimization niche compared to broader latent compression or attention mechanism research directions, though definitive novelty assessment would require examining additional candidates beyond the top-K semantic matches analyzed here.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient video generation with reduced computational complexity. The field has evolved into a rich ecosystem of approaches that tackle efficiency from multiple angles. At the highest level, the taxonomy reveals branches focused on attention mechanisms (reducing quadratic complexity in transformers), latent space compression (operating in lower-dimensional representations as in Latent Video Diffusion[11] and MAGVIT[20]), efficient training and adaptation (such as parameter-efficient fine-tuning in Tune A Video[24]), autoregressive and non-diffusion methods (e.g., Phenaki[26] and StyleGAN V[17]), inference-time optimization (distillation and few-step generation), hierarchical strategies (coarse-to-fine refinement in Pyramidal Flow[37]), domain-specific solutions (like DrivingGen[41] for autonomous driving), training without paired text-video data, scalable architectures for high-resolution or long videos (Open Sora[14] and Open Sora Plan[29]), and multiview coding efficiency. These branches are not isolated: latent compression often combines with attention optimizations, while distillation techniques can accelerate both diffusion and autoregressive models. Particularly active lines of work include inference-time acceleration through distillation and few-step generation, where the trade-off between sample quality and speed is central. Greedy Distill[0] sits squarely in this branch, aiming to compress multi-step diffusion into fewer iterations while preserving fidelity. Nearby works such as OSV[23] and AnimateLCM[32] similarly pursue few-step inference, though they may differ in distillation strategy or target architecture. Another vibrant area is latent space efficiency, where methods like Align Latents[2] and Sana Video[3] explore compact representations and efficient encoders. Greedy Distill[0] contrasts with these by focusing on the inference pipeline rather than the latent encoding itself, yet both directions share the goal of reducing computational overhead. Open questions persist around the scalability of distilled models to very long videos, the generalization of few-step methods across diverse content, and the interplay between training-time compression and inference-time acceleration.

Claimed Contributions

Greedy Distill distillation paradigm with linear time complexity

3 retrieved papers

The authors propose a new asymmetric distillation framework that reduces computational complexity from O(n²) to linear by distilling a bidirectional video diffusion teacher model into a student model with substantially different architecture. This paradigm enables efficient video generation while maintaining quality comparable to the teacher model.

3 retrieved papers

Streaming Diffusion Decoder (SDD) and Efficient Temporal Module (ETM)

10 retrieved papers

The authors introduce two novel architectural components: SDD generates the next frame using only the 0-th and last frames in a streaming manner, while ETM employs chunk-wise sliding window attention to capture both local and global temporal dependencies. Together, these modules achieve linear computational complexity.

10 retrieved papers

Reinforcement learning fine-tuning for error accumulation mitigation

Can Refute

9 retrieved papers

The authors present the first application of reinforcement learning fine-tuning to mitigate exposure bias and error accumulation in streaming video generation. The RL approach directly optimizes the model's own predictions throughout the generation process, reducing reliance on ground-truth context during inference.

9 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[23] Osv: One step is enough for high-quality image to video generation PDF

Xiao-Feng Mao, Zhengkai Jiang, Xiaofeng Mao, Fu-Yun Wang, Jiangning Zhang, Hao Chen, Wenbing Zhu, Mingmin Chi, Yabiao Wang, Wenhan Luo (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Greedy Distill distillation paradigm with linear time complexity

[51] Efficient-vdit: Efficient video diffusion transformers with attention tile PDF

Cannot Refute

[52] Attention surgery: An efficient recipe to linearize your video diffusion transformer PDF

Cannot Refute

[53] Linear Multistep Solver Distillation for Fast Sampling of Diffusion Models PDF

Cannot Refute

Contribution

Streaming Diffusion Decoder (SDD) and Efficient Temporal Module (ETM)

[5] MagicVideo: Efficient Video Generation With Latent Diffusion Models PDF

Cannot Refute

[12] Show-1: Marrying pixel and latent diffusion models for text-to-video generation PDF

Cannot Refute

[24] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation PDF

Cannot Refute

[28] xgen-videosyn-1: High-fidelity text-to-video synthesis with compressed representations PDF

Cannot Refute

[63] Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation PDF

Cannot Refute

[64] Mostgan-v: Video generation with temporal motion styles PDF

Cannot Refute

[65] Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling PDF

Cannot Refute

[66] Real-Time Video Generation with Pyramid Attention Broadcast PDF

Cannot Refute

[67] Video frame interpolation transformer PDF

Cannot Refute

[68] Matten: Video Generation with Mamba-Attention PDF

Cannot Refute

Contribution

Reinforcement learning fine-tuning for error accumulation mitigation

[54] Skyreels-v2: Infinite-length film generative model PDF

Can Refute

[55] Pdp: Physics-based character animation via diffusion policy PDF

Cannot Refute

[56] Time-series generation by contrastive imitation PDF

Cannot Refute

[57] Learning from Video for Control PDF

Cannot Refute

[58] Consistent Video Generation through Reinforcement Learning PDF

Cannot Refute

[59] Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation PDF

Cannot Refute

[60] BEYOND SINGLE-STEP: MULTI-FRAME ACTION-CONDITIONED VIDEO GENERATION FOR REINFORCE-MENT LEARNING ENVIRONMENTS PDF

Cannot Refute

[61] End-to-End Driving through Generative Video Pretraining PDF

Cannot Refute

[62] VaporTok: RL-Driven Adaptive Video Tokenizer with Prior & Task Awareness PDF

Cannot Refute

Greedy Distill: Efficient Video Generative Modeling with Linear Time Complexity

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[23] Osv: One step is enough for high-quality image to video generation PDF

Contribution Analysis

Greedy Distill distillation paradigm with linear time complexity

[51] Efficient-vdit: Efficient video diffusion transformers with attention tile PDF

[52] Attention surgery: An efficient recipe to linearize your video diffusion transformer PDF

[53] Linear Multistep Solver Distillation for Fast Sampling of Diffusion Models PDF

Streaming Diffusion Decoder (SDD) and Efficient Temporal Module (ETM)

[5] MagicVideo: Efficient Video Generation With Latent Diffusion Models PDF

[12] Show-1: Marrying pixel and latent diffusion models for text-to-video generation PDF

[24] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation PDF

[28] xgen-videosyn-1: High-fidelity text-to-video synthesis with compressed representations PDF

[63] Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation PDF

[64] Mostgan-v: Video generation with temporal motion styles PDF

[65] Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling PDF

[66] Real-Time Video Generation with Pyramid Attention Broadcast PDF

[67] Video frame interpolation transformer PDF

[68] Matten: Video Generation with Mamba-Attention PDF

Reinforcement learning fine-tuning for error accumulation mitigation

[54] Skyreels-v2: Infinite-length film generative model PDF

[55] Pdp: Physics-based character animation via diffusion policy PDF

[56] Time-series generation by contrastive imitation PDF

[57] Learning from Video for Control PDF

[58] Consistent Video Generation through Reinforcement Learning PDF

[59] Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation PDF

[60] BEYOND SINGLE-STEP: MULTI-FRAME ACTION-CONDITIONED VIDEO GENERATION FOR REINFORCE-MENT LEARNING ENVIRONMENTS PDF

[61] End-to-End Driving through Generative Video Pretraining PDF

[62] VaporTok: RL-Driven Adaptive Video Tokenizer with Prior & Task Awareness PDF

Table of Contents