QVGen: Pushing the Limit of Quantized Video Generative Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.8 Download Report PDF

quantization-aware trainingvideo diffusion models

Video diffusion models (DMs) have enabled high-quality video synthesis. Yet, their substantial computational and memory demands pose serious challenges to real-world deployment, even on high-end GPUs. As a commonly adopted solution, quantization has proven notable success in reducing cost for image DMs, while its direct application to video DMs remains ineffective. In this paper, we present QVGen, a novel quantization-aware training (QAT) framework tailored for high-performance and inference-efficient video DMs under extremely low-bit quantization (e.g., $4$ -bit or below). We begin with a theoretical analysis demonstrating that reducing the gradient norm is essential to facilitate convergence for QAT. To this end, we introduce auxiliary modules ( $\Phi$ ) to mitigate large quantization errors, leading to significantly enhanced convergence. To eliminate the inference overhead of $\Phi$ , we propose a rank-decay strategy that progressively eliminates $\Phi$ . Specifically, we repeatedly employ singular value decomposition (SVD) and a proposed rank-based regularization $\mathbf{\gamma}$ to identify and decay low-contributing components. This strategy retains performance while zeroing out additional inference overhead. Extensive experiments across $4$ state-of-the-art (SOTA) video DMs, with parameter sizes ranging from $1.3\text{B}\sim14\text{B}$ , show that QVGen is the first to reach full-precision comparable quality under $4$ -bit settings. Moreover, it significantly outperforms existing methods. For instance, our $3$ -bit CogVideoX-2B achieves improvements of $+25.28$ in Dynamic Degree and $+8.43$ in Scene Consistency on VBench. Code and videos are available in the supplementary material.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: quantization-aware training for video diffusion models. The field has organized itself around several complementary perspectives on reducing the computational and memory footprint of diffusion-based video generation. At the highest level, Quantization Strategy and Optimization explores fundamental training regimes—including quantization-aware training (QAT), post-training quantization (PTQ), and mixed-precision schemes—that directly address how to learn or calibrate low-bit representations. Feature-Aware Quantization focuses on exploiting structural properties of activations, weights, or temporal dynamics to assign bits more intelligently across layers or time steps. Joint Optimization with Complementary Techniques investigates hybrid approaches that combine quantization with pruning, knowledge distillation, or low-rank decomposition to achieve greater compression. Deployment-Oriented Quantization emphasizes hardware constraints and real-world inference scenarios, while Theoretical Foundations and Comprehensive Surveys provide broader context on convergence guarantees and design principles. Finally, Application-Specific Quantization tailors methods to particular domains such as sign language or medical imaging, where domain priors can guide bit allocation. Within the Quantization Strategy and Optimization branch, a dense cluster of works explores QAT variants that retrain or fine-tune diffusion models end-to-end with quantized operations. QVGen[0] exemplifies this direction by integrating quantization directly into the video diffusion training loop, aiming to preserve generation quality under aggressive bit-width reduction. Nearby efforts such as FraQAT[23] and DilateQuant[31] similarly adopt QAT but introduce specialized techniques—fractional bit allocations or dilated convolution-aware quantizers—to handle the unique temporal coherence demands of video. In contrast, methods like TCAQ[2] and Time-Rotation Diffusion Quantization[1] emphasize calibration strategies that adapt quantization parameters across diffusion timesteps or rotational embeddings, blurring the line between pure QAT and hybrid calibration. The central trade-off across these lines is whether to invest training compute for tighter integration (as QVGen[0] does) or to rely on lighter post-hoc adjustments that may sacrifice some quality but reduce retraining overhead.

Claimed Contributions

QVGen: A novel QAT framework for video diffusion models

9 retrieved papers

The authors present QVGen, the first quantization-aware training framework specifically designed for video diffusion models. It enables effective 3-bit and 4-bit quantization while achieving full-precision comparable quality, addressing the challenge that existing QAT methods fail to handle video generation tasks under extremely low-bit settings.

9 retrieved papers

Auxiliary modules (Φ) to reduce gradient norm and improve convergence

Can Refute

5 retrieved papers

The authors introduce learnable auxiliary modules that mitigate quantization errors during training. Through theoretical analysis demonstrating that reducing gradient norm is essential for QAT convergence, these modules stabilize the training process and significantly enhance convergence for extremely low-bit quantization.

5 retrieved papers

Can Refute

Rank-decay strategy to eliminate inference overhead

10 retrieved papers

The authors develop a rank-decay strategy that progressively removes auxiliary modules during training to eliminate inference overhead. This strategy repeatedly applies singular value decomposition and rank-based regularization to identify and decay low-contributing components, ultimately achieving zero additional inference cost while maintaining performance.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[23] FraQAT: Quantization Aware Training with Fractional bits PDF

Morreale, Luca, Ramos, Alberto Gil C. P., Luca Morreale, Chadwick, Malcolm, A. G. Ramos, Malcolm Chadwick, Chavhan, Ruchika, Mehid Noroozi, Mehrotra, Abhinav, Ruchika Chavhan, Bhattacharya, Sourav, A. Mehrotra, Sourav Bhattacharya (2025)

[31] DilateQuant: Accurate and Efficient Quantization-Aware Training for Diffusion Models via Weight Dilation PDF

Xuewen Liu, Zhikai Li, Minghao Jiang, Mengjuan Chen, Minhao Jiang, Jianquan Li, Qingyi Gu (2025)

[36] DilateQuant: Accurate and Efficient Diffusion Quantization via Weight Dilation PDF

Liu Xuewen, Xuewen Liu, Li, Zhikai, Zhikai Li, Jiang Min-hao, Qingyi Gu, Chen, Mengjuan, Li Jianquan, Gu, Qingyi (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

QVGen: A novel QAT framework for video diffusion models

[5] Quantization as a Foundation for Deployable High Performance Diffusion Models within the Landscape of Large Scale Generative AI PDF

Cannot Refute

[7] ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation PDF

Cannot Refute

[17] Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers PDF

Cannot Refute

[19] QVD: Post-training Quantization for Video Diffusion Models PDF

Cannot Refute

[54] Bidm: Pushing the limit of quantization for diffusion models PDF

Cannot Refute

[55] PTQ4DiT: Post-training Quantization for Diffusion Transformers PDF

Cannot Refute

[56] Q-dm: An efficient low-bit quantized diffusion model PDF

Cannot Refute

[57] MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models PDF

Cannot Refute

[58] MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation PDF

Cannot Refute

Contribution

Auxiliary modules (Φ) to reduce gradient norm and improve convergence

[50] Training Quantized Neural Networks With a Full-Precision Auxiliary Module PDF

Can Refute

[49] Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training PDF

Cannot Refute

[51] Punching Above Precision: Small Quantized Model Distillation with Learnable Regularizer PDF

Cannot Refute

[52] Stable Quantization-Aware Training with Adaptive Gradient Clipping PDF

Cannot Refute

[53] AQ-DETR: Low-Bit Quantized Detection Transformer with Auxiliary Queries PDF

Cannot Refute

Contribution

Rank-decay strategy to eliminate inference overhead

[39] Accelerated SVD-based initialization for nonnegative matrix factorization PDF

Cannot Refute

[40] Using SVD for topic modeling PDF

Cannot Refute

[41] Randomized greedy magic point selection schemes for nonlinear model reduction PDF

Cannot Refute

[42] Flashsvd: Memory-efficient inference with streaming for low-rank models PDF

Cannot Refute

[43] An efficient SVD-based method for image denoising PDF

Cannot Refute

[44] Self-supervised knowledge distillation using singular value decomposition PDF

Cannot Refute

[45] Structure-Preserving Network Compression Via Low-Rank Induced Training Through Linear Layers Composition PDF

Cannot Refute

[46] Unveiling lora intrinsic ranks via salience analysis PDF

Cannot Refute

[47] Adaptive Rank Allocation for Federated Parameter-Efficient Fine-Tuning of Language Models PDF

Cannot Refute

[48] Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models PDF

Cannot Refute

QVGen: Pushing the Limit of Quantized Video Generative Models

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[23] FraQAT: Quantization Aware Training with Fractional bits PDF

[31] DilateQuant: Accurate and Efficient Quantization-Aware Training for Diffusion Models via Weight Dilation PDF

[36] DilateQuant: Accurate and Efficient Diffusion Quantization via Weight Dilation PDF

Contribution Analysis

QVGen: A novel QAT framework for video diffusion models

[5] Quantization as a Foundation for Deployable High Performance Diffusion Models within the Landscape of Large Scale Generative AI PDF

[7] ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation PDF

[17] Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers PDF

[19] QVD: Post-training Quantization for Video Diffusion Models PDF

[54] Bidm: Pushing the limit of quantization for diffusion models PDF

[55] PTQ4DiT: Post-training Quantization for Diffusion Transformers PDF

[56] Q-dm: An efficient low-bit quantized diffusion model PDF

[57] MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models PDF

[58] MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation PDF

Auxiliary modules (Φ) to reduce gradient norm and improve convergence

[50] Training Quantized Neural Networks With a Full-Precision Auxiliary Module PDF

[49] Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training PDF

[51] Punching Above Precision: Small Quantized Model Distillation with Learnable Regularizer PDF

[52] Stable Quantization-Aware Training with Adaptive Gradient Clipping PDF

[53] AQ-DETR: Low-Bit Quantized Detection Transformer with Auxiliary Queries PDF

Rank-decay strategy to eliminate inference overhead

[39] Accelerated SVD-based initialization for nonnegative matrix factorization PDF

[40] Using SVD for topic modeling PDF

[41] Randomized greedy magic point selection schemes for nonlinear model reduction PDF

[42] Flashsvd: Memory-efficient inference with streaming for low-rank models PDF

[43] An efficient SVD-based method for image denoising PDF

[44] Self-supervised knowledge distillation using singular value decomposition PDF

[45] Structure-Preserving Network Compression Via Low-Rank Induced Training Through Linear Layers Composition PDF

[46] Unveiling lora intrinsic ranks via salience analysis PDF

[47] Adaptive Rank Allocation for Federated Parameter-Efficient Fine-Tuning of Language Models PDF

[48] Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models PDF

Table of Contents