QVGen: Pushing the Limit of Quantized Video Generative Models

ICLR 2026 Conference SubmissionAnonymous Authors
quantization-aware trainingvideo diffusion models
Abstract:

Video diffusion models (DMs) have enabled high-quality video synthesis. Yet, their substantial computational and memory demands pose serious challenges to real-world deployment, even on high-end GPUs. As a commonly adopted solution, quantization has proven notable success in reducing cost for image DMs, while its direct application to video DMs remains ineffective. In this paper, we present QVGen, a novel quantization-aware training (QAT) framework tailored for high-performance and inference-efficient video DMs under extremely low-bit quantization (e.g., 44-bit or below). We begin with a theoretical analysis demonstrating that reducing the gradient norm is essential to facilitate convergence for QAT. To this end, we introduce auxiliary modules (Φ\Phi) to mitigate large quantization errors, leading to significantly enhanced convergence. To eliminate the inference overhead of Φ\Phi, we propose a rank-decay strategy that progressively eliminates Φ\Phi. Specifically, we repeatedly employ singular value decomposition (SVD) and a proposed rank-based regularization γ\mathbf{\gamma} to identify and decay low-contributing components. This strategy retains performance while zeroing out additional inference overhead. Extensive experiments across 44 state-of-the-art (SOTA) video DMs, with parameter sizes ranging from 1.3B14B1.3\text{B}\sim14\text{B}, show that QVGen is the first to reach full-precision comparable quality under 44-bit settings. Moreover, it significantly outperforms existing methods. For instance, our 33-bit CogVideoX-2B achieves improvements of +25.28+25.28 in Dynamic Degree and +8.43+8.43 in Scene Consistency on VBench. Code and videos are available in the supplementary material.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
38
3
Claimed Contributions
24
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: quantization-aware training for video diffusion models. The field has organized itself around several complementary perspectives on reducing the computational and memory footprint of diffusion-based video generation. At the highest level, Quantization Strategy and Optimization explores fundamental training regimes—including quantization-aware training (QAT), post-training quantization (PTQ), and mixed-precision schemes—that directly address how to learn or calibrate low-bit representations. Feature-Aware Quantization focuses on exploiting structural properties of activations, weights, or temporal dynamics to assign bits more intelligently across layers or time steps. Joint Optimization with Complementary Techniques investigates hybrid approaches that combine quantization with pruning, knowledge distillation, or low-rank decomposition to achieve greater compression. Deployment-Oriented Quantization emphasizes hardware constraints and real-world inference scenarios, while Theoretical Foundations and Comprehensive Surveys provide broader context on convergence guarantees and design principles. Finally, Application-Specific Quantization tailors methods to particular domains such as sign language or medical imaging, where domain priors can guide bit allocation. Within the Quantization Strategy and Optimization branch, a dense cluster of works explores QAT variants that retrain or fine-tune diffusion models end-to-end with quantized operations. QVGen[0] exemplifies this direction by integrating quantization directly into the video diffusion training loop, aiming to preserve generation quality under aggressive bit-width reduction. Nearby efforts such as FraQAT[23] and DilateQuant[31] similarly adopt QAT but introduce specialized techniques—fractional bit allocations or dilated convolution-aware quantizers—to handle the unique temporal coherence demands of video. In contrast, methods like TCAQ[2] and Time-Rotation Diffusion Quantization[1] emphasize calibration strategies that adapt quantization parameters across diffusion timesteps or rotational embeddings, blurring the line between pure QAT and hybrid calibration. The central trade-off across these lines is whether to invest training compute for tighter integration (as QVGen[0] does) or to rely on lighter post-hoc adjustments that may sacrifice some quality but reduce retraining overhead.

Claimed Contributions

QVGen: A novel QAT framework for video diffusion models

The authors present QVGen, the first quantization-aware training framework specifically designed for video diffusion models. It enables effective 3-bit and 4-bit quantization while achieving full-precision comparable quality, addressing the challenge that existing QAT methods fail to handle video generation tasks under extremely low-bit settings.

9 retrieved papers
Auxiliary modules (Φ) to reduce gradient norm and improve convergence

The authors introduce learnable auxiliary modules that mitigate quantization errors during training. Through theoretical analysis demonstrating that reducing gradient norm is essential for QAT convergence, these modules stabilize the training process and significantly enhance convergence for extremely low-bit quantization.

5 retrieved papers
Can Refute
Rank-decay strategy to eliminate inference overhead

The authors develop a rank-decay strategy that progressively removes auxiliary modules during training to eliminate inference overhead. This strategy repeatedly applies singular value decomposition and rank-based regularization to identify and decay low-contributing components, ultimately achieving zero additional inference cost while maintaining performance.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

QVGen: A novel QAT framework for video diffusion models

The authors present QVGen, the first quantization-aware training framework specifically designed for video diffusion models. It enables effective 3-bit and 4-bit quantization while achieving full-precision comparable quality, addressing the challenge that existing QAT methods fail to handle video generation tasks under extremely low-bit settings.

Contribution

Auxiliary modules (Φ) to reduce gradient norm and improve convergence

The authors introduce learnable auxiliary modules that mitigate quantization errors during training. Through theoretical analysis demonstrating that reducing gradient norm is essential for QAT convergence, these modules stabilize the training process and significantly enhance convergence for extremely low-bit quantization.

Contribution

Rank-decay strategy to eliminate inference overhead

The authors develop a rank-decay strategy that progressively removes auxiliary modules during training to eliminate inference overhead. This strategy repeatedly applies singular value decomposition and rank-based regularization to identify and decay low-contributing components, ultimately achieving zero additional inference cost while maintaining performance.