QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Video GenerationModel QuantizationAttention Sparsification

Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment. Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression. Combining them promises compounded efficiency gains, but naive integration is ineffective. The sparsity-induced information loss exacerbates quantization noise, leading to amplified attention shifts. To address this, we propose QuantSparse, a unified framework that integrates model quantization with attention sparsification. Specifically, we introduce Multi-Scale Salient Attention Distillation, which leverages both global structural guidance and local salient supervision to mitigate quantization-induced bias. In addition, we develop Second-Order Sparse Attention Reparameterization, which exploits the temporal stability of second-order residuals to efficiently recover information lost under sparsity. Experiments on HunyuanVideo-13B demonstrate that QuantSparse achieves 20.88 PSNR, substantially outperforming the state-of-the-art quantization baseline Q-VDiT (16.85 PSNR), while simultaneously delivering a 3.68 $\times$ reduction in storage and 1.88 $\times$ acceleration in end-to-end inference.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes QuantSparse, a unified framework integrating model quantization with attention sparsification for compressing video diffusion transformers. It sits within the 'Unified Quantization-Sparsity Frameworks' leaf of the taxonomy, which contains only two papers total (including the original). This suggests a relatively sparse research direction, as most prior work has focused on applying quantization or sparsification independently rather than co-designing them. The framework's core contribution addresses the challenge that naive integration of these techniques leads to amplified attention shifts due to sparsity-induced information loss exacerbating quantization noise.

The taxonomy reveals a broader landscape where quantization-based and sparsification-based compression constitute distinct, well-populated branches with specialized subtopics. Neighboring areas include 'Training-Aware Co-Design' (joint FP8 quantization and sparsity optimization with training) and 'Pattern-Aware Reordering for Sparse Quantization' (attention pattern reordering for combined techniques). The unified framework approach differs from these by targeting post-training scenarios without pattern reordering. The 'Caching and Dynamic Computation Optimization' branch offers orthogonal strategies exploiting temporal reuse, while 'Quantization-Based Compression' alone contains multiple specialized PTQ methods that do not address sparsity interactions.

Among the three contributions analyzed from 20 candidate papers, the unified framework itself shows one refutable candidate among 10 examined, indicating some prior exploration of combined quantization-sparsity approaches within the limited search scope. Multi-Scale Salient Attention Distillation examined 6 candidates with none clearly refuting the novelty, suggesting the specific distillation strategy may be less explored. Second-Order Sparse Attention Reparameterization examined 4 candidates without refutation, pointing to potential novelty in exploiting temporal stability of second-order residuals. The limited search scope (20 papers total) means these assessments reflect top semantic matches rather than exhaustive field coverage.

Based on this limited literature analysis, the work appears to occupy a relatively under-explored niche at the intersection of quantization and sparsification for video diffusion transformers. The single sibling paper in the same taxonomy leaf and the sparse population of the 'Joint Quantization and Sparsification' branch suggest the unified approach is not yet crowded. However, the presence of one potentially overlapping candidate for the core framework contribution indicates that the fundamental idea of combining these techniques has been attempted, even if the specific distillation and reparameterization strategies may offer differentiation.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: compressing video diffusion transformers with quantization and sparsification. The field has organized itself around several complementary strategies for reducing the computational and memory demands of video diffusion transformers. Quantization-Based Compression focuses on reducing numerical precision through post-training or training-aware methods, with works like Q-VDiT[6] and Q-dit[7] exploring low-bit representations. Sparsification-Based Compression targets redundancy in attention mechanisms and temporal structures, as seen in Sparse videogen[1] and Bidirectional sparse attention[2]. Joint Quantization and Sparsification combines both techniques within unified frameworks, while Caching and Dynamic Computation Optimization exploits temporal reuse patterns across diffusion steps, exemplified by QuantCache[22]. Architectural Compression and Pruning addresses model structure itself, Mobile and Edge Deployment Optimization tailors solutions for resource-constrained devices, and Specialized Compression Techniques encompasses domain-specific innovations that do not fit neatly into other categories. Recent work has increasingly explored the synergy between quantization and sparsity, recognizing that each addresses different bottlenecks in video generation pipelines. QuantSparse[0] sits within the Unified Quantization-Sparsity Frameworks branch, emphasizing the joint application of both techniques to achieve greater compression than either alone. This approach contrasts with purely quantization-focused methods like Efficient-vdit[5], which prioritizes precision reduction, and purely sparse methods that target attention patterns. A closely related work, DitFastAttn[3], also explores combined strategies but may differ in how it balances the two dimensions or handles temporal dependencies. The central tension across these branches involves trade-offs between compression ratio, generation quality, and inference speed, with unified frameworks attempting to navigate this space more holistically than single-technique approaches.

Claimed Contributions

QuantSparse unified compression framework

Can Refute

10 retrieved papers

The authors introduce QuantSparse, a framework that synergistically combines model quantization and attention sparsification to compress video diffusion transformers. This addresses the severe performance degradation that occurs when naively integrating these two orthogonal compression techniques.

10 retrieved papers

Can Refute

Multi-Scale Salient Attention Distillation (MSAD)

6 retrieved papers

A memory-efficient distillation scheme that balances global structural guidance (via downsampled attention) and local salient supervision (focusing on high-impact tokens) to align quantized attention with full-precision attention and mitigate quantization-induced bias.

6 retrieved papers

Second-Order Sparse Attention Reparameterization (SSAR)

4 retrieved papers

A technique that exploits the temporal stability of second-order residuals (rather than first-order) to recover information lost due to sparsity. It uses SVD projection onto dominant principal components to provide lightweight yet accurate correction of sparse attention outputs.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Ditfastattn: Attention compression for diffusion transformer models PDF

Guohao Dai, Pu Lu, Xuefei Ning, Yu Wang, Yan Shengen, Zhihang Yuan, Hanling Zhang, Linfeng Zhang, Tianchen Zhao (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

QuantSparse unified compression framework

[24] FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion PDF

Can Refute

[3] Ditfastattn: Attention compression for diffusion transformer models PDF

Cannot Refute

[6] Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers PDF

Cannot Refute

[13] SQ-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation PDF

Cannot Refute

[25] PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models PDF

Cannot Refute

[32] S2Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation PDF

Cannot Refute

[43] Spargeattn: Accurate sparse attention accelerating any model inference PDF

Cannot Refute

[44] SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference PDF

Cannot Refute

[45] TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times PDF

Cannot Refute

[46] Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention PDF

Cannot Refute

Contribution

Multi-Scale Salient Attention Distillation (MSAD)

[13] SQ-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation PDF

Cannot Refute

[32] S2Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation PDF

Cannot Refute

[51] LGFA-MTKD: Enhancing Multi-Teacher Knowledge Distillation with Local and Global Frequency Attention PDF

Cannot Refute

[52] MimiQ: Low-Bit Data-Free Quantization of Vision Transformers with Encouraging Inter-Head Attention Similarity PDF

Cannot Refute

[53] Glamd: Global and local attention mask distillation for object detectors PDF

Cannot Refute

[54] Efficient low-bit quantization with adaptive scales for multi-task co-training PDF

Cannot Refute

Contribution

Second-Order Sparse Attention Reparameterization (SSAR)

[47] An object detection approach with residual feature fusion and second-order term attention mechanism PDF

Cannot Refute

[48] â¦ framework: A stock dynamic anomaly detection and trend prediction model based on graph attention network and sparse spatiotemporal convolutional network PDF

Cannot Refute

[49] Sparse Transformer Visual Tracking Network Based on Second-Order Attention PDF

Cannot Refute

[50] Image retrieval based on dimensionality reduction of second-order information PDF

Cannot Refute

QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Ditfastattn: Attention compression for diffusion transformer models PDF

Contribution Analysis

QuantSparse unified compression framework

[24] FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion PDF

[3] Ditfastattn: Attention compression for diffusion transformer models PDF

[6] Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers PDF

[13] SQ-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation PDF

[25] PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models PDF

[32] S2Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation PDF

[43] Spargeattn: Accurate sparse attention accelerating any model inference PDF

[44] SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference PDF

[45] TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times PDF

[46] Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention PDF

Multi-Scale Salient Attention Distillation (MSAD)

[13] SQ-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation PDF

[32] S2Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation PDF

[51] LGFA-MTKD: Enhancing Multi-Teacher Knowledge Distillation with Local and Global Frequency Attention PDF

[52] MimiQ: Low-Bit Data-Free Quantization of Vision Transformers with Encouraging Inter-Head Attention Similarity PDF

[53] Glamd: Global and local attention mask distillation for object detectors PDF

[54] Efficient low-bit quantization with adaptive scales for multi-task co-training PDF

Second-Order Sparse Attention Reparameterization (SSAR)

[47] An object detection approach with residual feature fusion and second-order term attention mechanism PDF

[48] â¦ framework: A stock dynamic anomaly detection and trend prediction model based on graph attention network and sparse spatiotemporal convolutional network PDF

[49] Sparse Transformer Visual Tracking Network Based on Second-Order Attention PDF

[50] Image retrieval based on dimensionality reduction of second-order information PDF

Table of Contents

[48] â¦ framework: A stock dynamic anomaly detection and trend prediction model based on graph attention network and sparse spatiotemporal convolutional network PDF