Greedy Distill: Efficient Video Generative Modeling with Linear Time Complexity

ICLR 2026 Conference SubmissionAnonymous Authors
Video GenerationDiffusion based models Trmodel distillation
Abstract:

Due to bidirectional attention dependencies, video generation models generally suffer from O(n2)O(n^2) computational complexity. In this work, we find the “local inter-frame information redundancy" phenomenon which indicates strong local temporal dependencies in video generation, with global attention to distant frames contributing only marginally. Built upon this finding, we introduce a novel distillation training paradigm for video diffusion models, namely GREEDY DISTILL. Specifically, to generate the next frame using only the 0-th and the last frames, we propose the Streaming Diffusion Decoder (SDD) as the “Greedy Decoder" to avoid redundant computational costs from the other frames. Meanwhile, to our knowledge, we introduce Efficient Temporal Module (ETM) to capture the global temporal information across frames. These two modules achieve the computational complexity reduction from O(n2)O(n^2) to linear. Moreover, we make the first attempt to apply RL fine-tuning to address the error accumulation during streaming generation. Our method achieves an overall score of 84.60 on the VBench benchmark, surpassing previous state-of-the-art methods by large margins(+4.18%). Qualitative results also demonstrate superior performance. Leveraging its efficient model structure and KV cache, it is able to rapidly generate high-quality video streams at 24 FPS (nearly 50% faster) on a single H100 GPU.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a distillation paradigm for video diffusion models that reduces computational complexity from quadratic to linear by leveraging local temporal redundancy. It proposes the Streaming Diffusion Decoder (SDD) and Efficient Temporal Module (ETM) to generate frames using only the initial and last frames, avoiding redundant computation. Within the taxonomy, it resides in the 'Distillation and Few-Step Generation' leaf under 'Inference-Time Optimization and Acceleration', sharing this category with only one sibling paper. This indicates a relatively sparse research direction focused specifically on distillation-based inference acceleration, though the broader parent branch of inference-time optimization is more populated.

The paper's position connects to several neighboring research directions. The 'Test-Time Scaling and Optimization' leaf explores iterative refinement strategies that contrast with the few-step approach here. The 'Latent Space Compression' branch (including Latent Video Diffusion and MAGVIT) addresses efficiency through representation compression rather than inference distillation, representing a complementary strategy. The 'Linear Attention and Block-Based Architectures' subtopic under 'Computational Complexity Reduction via Attention Mechanisms' tackles quadratic complexity through architectural modifications rather than distillation. The taxonomy's scope notes clarify that this work belongs in inference optimization rather than training-time or architectural categories, distinguishing it from parameter-efficient adaptation methods.

Among the three contributions analyzed, the core distillation paradigm examined three candidates with none providing clear refutation, suggesting novelty in the specific greedy frame selection strategy. The SDD and ETM modules examined ten candidates without refutation, indicating these architectural components may represent new designs within the limited search scope. The reinforcement learning fine-tuning contribution examined nine candidates and found one potentially refutable prior work, suggesting this aspect has more substantial overlap with existing techniques. The analysis covered twenty-two total candidates from semantic search, providing a focused but not exhaustive view of the literature landscape.

Based on the limited search scope of twenty-two candidates, the work appears to introduce novel mechanisms for streaming video generation through distillation, particularly in the greedy decoder design. The reinforcement learning component shows more connection to prior work, which is expected given RL's established use in addressing sequential generation errors. The taxonomy structure reveals this sits in a less crowded inference-optimization niche compared to broader latent compression or attention mechanism research directions, though definitive novelty assessment would require examining additional candidates beyond the top-K semantic matches analyzed here.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: efficient video generation with reduced computational complexity. The field has evolved into a rich ecosystem of approaches that tackle efficiency from multiple angles. At the highest level, the taxonomy reveals branches focused on attention mechanisms (reducing quadratic complexity in transformers), latent space compression (operating in lower-dimensional representations as in Latent Video Diffusion[11] and MAGVIT[20]), efficient training and adaptation (such as parameter-efficient fine-tuning in Tune A Video[24]), autoregressive and non-diffusion methods (e.g., Phenaki[26] and StyleGAN V[17]), inference-time optimization (distillation and few-step generation), hierarchical strategies (coarse-to-fine refinement in Pyramidal Flow[37]), domain-specific solutions (like DrivingGen[41] for autonomous driving), training without paired text-video data, scalable architectures for high-resolution or long videos (Open Sora[14] and Open Sora Plan[29]), and multiview coding efficiency. These branches are not isolated: latent compression often combines with attention optimizations, while distillation techniques can accelerate both diffusion and autoregressive models. Particularly active lines of work include inference-time acceleration through distillation and few-step generation, where the trade-off between sample quality and speed is central. Greedy Distill[0] sits squarely in this branch, aiming to compress multi-step diffusion into fewer iterations while preserving fidelity. Nearby works such as OSV[23] and AnimateLCM[32] similarly pursue few-step inference, though they may differ in distillation strategy or target architecture. Another vibrant area is latent space efficiency, where methods like Align Latents[2] and Sana Video[3] explore compact representations and efficient encoders. Greedy Distill[0] contrasts with these by focusing on the inference pipeline rather than the latent encoding itself, yet both directions share the goal of reducing computational overhead. Open questions persist around the scalability of distilled models to very long videos, the generalization of few-step methods across diverse content, and the interplay between training-time compression and inference-time acceleration.

Claimed Contributions

Greedy Distill distillation paradigm with linear time complexity

The authors propose a new asymmetric distillation framework that reduces computational complexity from O(n²) to linear by distilling a bidirectional video diffusion teacher model into a student model with substantially different architecture. This paradigm enables efficient video generation while maintaining quality comparable to the teacher model.

3 retrieved papers
Streaming Diffusion Decoder (SDD) and Efficient Temporal Module (ETM)

The authors introduce two novel architectural components: SDD generates the next frame using only the 0-th and last frames in a streaming manner, while ETM employs chunk-wise sliding window attention to capture both local and global temporal dependencies. Together, these modules achieve linear computational complexity.

10 retrieved papers
Reinforcement learning fine-tuning for error accumulation mitigation

The authors present the first application of reinforcement learning fine-tuning to mitigate exposure bias and error accumulation in streaming video generation. The RL approach directly optimizes the model's own predictions throughout the generation process, reducing reliance on ground-truth context during inference.

9 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Greedy Distill distillation paradigm with linear time complexity

The authors propose a new asymmetric distillation framework that reduces computational complexity from O(n²) to linear by distilling a bidirectional video diffusion teacher model into a student model with substantially different architecture. This paradigm enables efficient video generation while maintaining quality comparable to the teacher model.

Contribution

Streaming Diffusion Decoder (SDD) and Efficient Temporal Module (ETM)

The authors introduce two novel architectural components: SDD generates the next frame using only the 0-th and last frames in a streaming manner, while ETM employs chunk-wise sliding window attention to capture both local and global temporal dependencies. Together, these modules achieve linear computational complexity.

Contribution

Reinforcement learning fine-tuning for error accumulation mitigation

The authors present the first application of reinforcement learning fine-tuning to mitigate exposure bias and error accumulation in streaming video generation. The RL approach directly optimizes the model's own predictions throughout the generation process, reducing reliance on ground-truth context during inference.