BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Diffusion ModelVideo GenerationCache

Recent advancements in Diffusion Transformers (DiTs) have established them as the state-of-the-art method for video generation. However, their inherently sequential denoising process results in inevitable latency, limiting real-world applicability. Existing acceleration methods either compromise visual quality due to architectural modifications or fail to reuse intermediate features at proper granularity. Our analysis reveals that DiT blocks are the primary contributors to inference latency. Across diffusion timesteps, the feature variations of DiT blocks exhibit a U-shaped pattern with high similarity during intermediate timesteps, which suggests substantial computational redundancy. In this paper, we propose Block-Wise Caching (BWCache), a training-free method to accelerate DiT-based video generation. BWCache dynamically caches and reuses features from DiT blocks across diffusion timesteps. Furthermore, we introduce a similarity indicator that triggers feature reuse only when the differences between block features at adjacent timesteps fall below a threshold, thereby minimizing redundant computations while maintaining visual fidelity. Extensive experiments on several video diffusion models demonstrate that BWCache achieves up to 2.24 $\times$ speedup with comparable visual quality.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes BWCache, a training-free method that caches and reuses features from entire DiT blocks across diffusion timesteps to accelerate video generation. It resides in the Block and Layer-Level Caching leaf, which contains four papers including the original work. This leaf sits within the broader Feature Caching Granularity and Reuse Strategies branch, indicating a moderately populated research direction focused on structural unit caching. The taxonomy shows this is one of four granularity approaches, suggesting the field has diversified into multiple caching strategies rather than converging on a single dominant paradigm.

The taxonomy reveals neighboring leaves exploring alternative granularities: Token-Level Caching (three papers) focuses on selective token reuse, while Hybrid and Multi-Granularity Caching (three papers) combines multiple levels. The Temporal Scheduling and Adaptive Caching branch (eleven papers across four leaves) addresses complementary questions of when to cache, with Similarity-Driven Adaptive Caching being particularly relevant. BWCache's block-level approach contrasts with token-wise methods that offer finer control but higher overhead, and differs from hybrid frameworks that blend multiple granularities. The taxonomy's scope and exclude notes clarify that BWCache's structural unit focus distinguishes it from temporal scheduling or memory optimization directions.

Among thirty candidates examined, the analysis found nine refutable pairs across three contributions. The core BWCache method examined ten candidates with two appearing to refute it, while the similarity indicator examined ten with three refutable matches, and the U-shaped variation analysis examined ten with four refutable candidates. These statistics suggest that within the limited search scope, each contribution faces some degree of prior overlap, with the feature dynamics analysis encountering the most substantial prior work. The block-wise caching concept and similarity-based triggering both show moderate overlap among the examined candidates, though the search scale limits definitive conclusions about field-wide novelty.

Based on the top-thirty semantic matches examined, the work appears to build on established block-level caching concepts with incremental refinements in similarity-based triggering and feature dynamics analysis. The taxonomy structure indicates this is an active but not overcrowded research area, with the original paper positioned among three siblings in its leaf. The analysis does not cover exhaustive citation networks or recent preprints, so additional related work may exist beyond the examined candidates.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: accelerating video diffusion transformers through block-wise caching. The field of accelerating video diffusion transformers has rapidly diversified into several complementary directions. Feature Caching Granularity and Reuse Strategies explores how to cache and reuse intermediate computations at different levels—ranging from token-wise approaches like Token-Wise Feature Caching[1] and Token Caching[3] to block and layer-level methods such as BWCache[0] and Blockdance[11]. Temporal Scheduling and Adaptive Caching focuses on dynamically deciding when and what to cache across diffusion timesteps, with works like Adaptive Caching[6] and Runtime-Adaptive Caching[7] learning or heuristically adjusting cache policies. Memory and Storage Optimization tackles the overhead of storing cached features through quantization (Quantcache[15]) and compression techniques (MagCache[16], Ca2-VDM[17]). Specialized Architectures and Conditioning investigates architectural modifications and conditioning mechanisms that inherently reduce computation, while Distributed and Parallel Inference addresses multi-device scenarios. Finally, Training-Free Acceleration Frameworks encompasses holistic systems that combine multiple strategies without requiring model retraining, exemplified by approaches like FORA[23] and Unicp[24]. Within Feature Caching Granularity and Reuse Strategies, a central tension emerges between fine-grained token-level caching—which offers flexibility but may incur higher bookkeeping costs—and coarser block or layer-level caching that simplifies implementation at the potential expense of adaptability. BWCache[0] sits squarely in the Block and Layer-Level Caching cluster alongside Learning-to-Cache[8] and CorGi[29], emphasizing structured reuse of entire transformer blocks across timesteps. This contrasts with token-centric methods like Token Caching[3] and Dual Feature Caching[2], which selectively cache individual tokens based on redundancy metrics. Meanwhile, hybrid strategies such as Blockdance[11] blend block-level decisions with finer control, illustrating ongoing exploration of the granularity sweet spot. The interplay between caching granularity, memory footprint, and quality preservation remains an active research question, with BWCache[0] contributing a block-wise perspective that balances efficiency gains against the need for temporal coherence in video generation.

Claimed Contributions

Block-Wise Caching (BWCache) method for accelerating DiT-based video generation

Can Refute

10 retrieved papers

The authors introduce BWCache, a training-free acceleration method that dynamically caches and reuses features from DiT blocks across diffusion timesteps. This method can be seamlessly integrated into most DiT-based models as a plug-and-play component during inference.

10 retrieved papers

Can Refute

Similarity indicator for triggering feature reuse

Can Refute

10 retrieved papers

The authors propose a similarity indicator based on the relative L1 distance between block features at adjacent timesteps. This indicator determines when to reuse cached features versus recomputing them, balancing computational efficiency with visual quality.

10 retrieved papers

Can Refute

Analysis of DiT block feature dynamics revealing U-shaped variation pattern

Can Refute

10 retrieved papers

Cannot Refute

[50] SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching PDF

Cannot Refute

[52] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models PDF

Cannot Refute

Contribution

Similarity indicator for triggering feature reuse

[37] Freqca: Accelerating diffusion models via frequency-aware caching PDF

Can Refute

[39] Frdiff: Feature reuse for universal training-free acceleration of diffusion models PDF

Cannot Refute

[49] Fantasyid: Face knowledge enhanced id-preserving video generation PDF

Cannot Refute

BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching PDF

[11] Blockdance: Reuse structurally similar spatio-temporal features to accelerate diffusion transformers PDF

[29] CorGi: Contribution-Guided Block-Wise Interval Caching for Training-Free Acceleration of Diffusion Transformers PDF

Contribution Analysis

Block-Wise Caching (BWCache) method for accelerating DiT-based video generation

[1] Accelerating diffusion transformers with token-wise feature caching PDF

[51] Î-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers PDF

[2] Accelerating diffusion transformers with dual feature caching PDF

[5] Fast and memory-efficient video diffusion using streamlined inference PDF

[6] Adaptive Caching for Faster Video Generation with Diffusion Transformers PDF

[9] Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model PDF

[13] AB-Cache: Training-Free Acceleration of Diffusion Models via Adams-Bashforth Cached Feature Reuse PDF

[14] Sana-video: Efficient video generation with block linear diffusion transformer PDF

[50] SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching PDF

[52] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models PDF

Similarity indicator for triggering feature reuse

[37] Freqca: Accelerating diffusion models via frequency-aware caching PDF

[39] Frdiff: Feature reuse for universal training-free acceleration of diffusion models PDF

[41] dllm-cache: Accelerating diffusion large language models with adaptive caching PDF

[13] AB-Cache: Training-Free Acceleration of Diffusion Models via Adams-Bashforth Cached Feature Reuse PDF

[34] Distrifusion: Distributed parallel inference for high-resolution diffusion models PDF

[35] From reusing to forecasting: Accelerating diffusion models with taylorseers PDF

[36] SpecDiff: Accelerating Diffusion Model Inference with Self-Speculation PDF

[38] Fast sampling through the reuse of attention maps in diffusion models PDF

[40] Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models PDF

[42] Plug-and-Play Context Feature Reuse for Efficient Masked Generation PDF

Analysis of DiT block feature dynamics revealing U-shaped variation pattern

[11] Blockdance: Reuse structurally similar spatio-temporal features to accelerate diffusion transformers PDF

[25] Sortblock: Similarity-Aware Feature Reuse for Diffusion Model PDF

[35] From reusing to forecasting: Accelerating diffusion models with taylorseers PDF

[45] Dynamic diffusion transformer PDF

[43] Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers PDF

[44] Tora: Trajectory-oriented Diffusion Transformer for Video Generation PDF

[46] LaVin-DiT: Large Vision Diffusion Transformer PDF

[47] Layer-and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers PDF

[48] Scaling diffusion transformers to 16 billion parameters PDF

[49] Fantasyid: Face knowledge enhanced id-preserving video generation PDF

Table of Contents

[51] Î-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers PDF