Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

dLLMsInference Acceleration

Diffusion Large Language Models (dLLMs) offer a promising alternative to autoregressive models, excelling in text generation tasks due to their bidirectional attention mechanisms. However, their computational complexity, scaling as $\mathcal{O}(L^3)$ with sequence length $L$ , poses significant challenges for long-sequence and real-time applications, primarily due to the lack of compatibility with key-value caching and the non-autoregressive nature of denoising steps. Existing acceleration methods rely on static caching or parallel decoding strategies, which fail to account for the dynamic behavior of token properties across layers and decoding steps. We propose \textbf{Dynamic-dLLM}, a training-free framework that enhances dLLM inference efficiency through two components: Dynamic Cache Updating (DCU), which adaptively allocates cache-update budgets based on layer-wise token dynamics, and Adaptive Parallel Decoding (APD), which dynamically calibrates decoding thresholds to balance generation quality and efficiency. Extensive experiments on models like LLaDA-8B-Instruct, LLaDA-1.5, and Dream-v0-7B-Instruct across benchmarks such as MMLU, GSM8K, and HumanEval demonstrate that Dynamic-dLLM significantly improves inference speed, attaining an average speedup of exceeding 3 $\times$ while maintaining performance. Dynamic-dLLM outperforms state-of-the-art acceleration methods and provides a plug-and-play solution for efficient dLLM deployment without compromising performance. Code and models will be made publicly available.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Dynamic-dLLM, a training-free framework combining Dynamic Cache Updating (DCU) and Adaptive Parallel Decoding (APD) to accelerate diffusion language model inference. It resides in the 'Adaptive Cache Management and Eviction' leaf, which contains five papers total including this one. This leaf sits within the broader 'Caching Mechanisms for Diffusion LLMs' branch, indicating a moderately active research direction focused on reducing redundant computation through selective cache retention. The taxonomy reveals this is a well-populated area with multiple competing approaches to cache management.

The taxonomy shows neighboring leaves include 'KV-Cache Adaptation for Bidirectional Attention' (three papers on enabling traditional caching in bidirectional models) and several parallel decoding categories ('Confidence-Based Parallel Decoding', 'Adaptive and Learnable Parallel Decoding'). Dynamic-dLLM bridges these areas by combining adaptive caching with parallel decoding, positioning it at the intersection of two major acceleration paradigms. The taxonomy's scope notes clarify that this leaf excludes static KV-cache implementations and parallel-only methods, emphasizing the focus on dynamic, layer-aware cache management strategies that respond to token-level dynamics during generation.

Among the twenty-three candidates examined, the DCU mechanism shows overlap with three prior works out of ten candidates reviewed, while APD overlaps with one of three candidates examined. The combined framework shows overlap with two of ten candidates. These statistics suggest that while individual components have some precedent in the limited search scope, the specific combination and dynamic calibration approach may offer incremental novelty. The relatively small candidate pool (twenty-three total) means these findings reflect top semantic matches rather than exhaustive coverage of the field's prior work.

Based on the limited search scope, the work appears to synthesize existing acceleration paradigms—adaptive caching and parallel decoding—into a unified framework. The taxonomy context reveals this sits in a moderately crowded research direction with multiple sibling papers exploring similar cache eviction strategies. The contribution-level statistics indicate partial overlap with prior work, though the specific dynamic calibration mechanisms may differentiate it from static or fixed-threshold approaches documented in the examined candidates.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Accelerating inference of diffusion large language models. The field has organized itself around several complementary strategies for speeding up diffusion-based text generation. Caching mechanisms (e.g., dLLM-Cache[10], dCache[35], d2Cache[33]) focus on reusing intermediate computations across diffusion steps to reduce redundant calculations. Parallel and adaptive decoding strategies (e.g., DiffuSpec[14], CreditDecoding[4], Learnable Parallel[29]) explore ways to predict or generate multiple tokens or steps simultaneously, trading off accuracy for throughput. Model architecture and training innovations (e.g., Sequential Diffusion[5], Block Diffusion[9], Seed Diffusion[15]) redesign the diffusion process itself to require fewer steps or simpler operations. Model compression and quantization (e.g., DLLMQuant[3], SparseD[17], Activation Sparsity[18]) reduce memory and compute footprints by pruning or quantizing model weights and activations. Meanwhile, diffusion LLM frameworks and benchmarks (e.g., dinfer[6], Dream[1], Diffusion Survey[2]) provide infrastructure and evaluation protocols, while related acceleration techniques borrow ideas from autoregressive LLM speedup methods like speculative sampling. Within this landscape, adaptive cache management and eviction strategies represent a particularly active line of work, balancing memory constraints against the need to preserve useful intermediate states across diffusion steps. Dynamic-dLLM[0] sits squarely in this branch, emphasizing runtime decisions about which cached entries to retain or discard based on evolving generation context. This contrasts with static caching schemes like dLLM-Cache[10] or d2Cache[33], which apply more uniform retention policies. Nearby works such as Sparse-dLLM[45] and dKV-Cache[8] also explore selective retention but differ in their eviction criteria—some prioritize attention scores, others leverage sparsity patterns in activations. The central trade-off across these methods is between cache hit rates and memory overhead, with open questions remaining about how to best adapt eviction policies to diverse prompt distributions and diffusion schedules.

Claimed Contributions

Dynamic Cache Updating (DCU) mechanism

Can Refute

10 retrieved papers

DCU is a mechanism that dynamically distributes cache-update budgets across layers according to the varying dynamics of tokens at different layers. It prioritizes layers requiring frequent updates while reducing computational overhead in stable layers, addressing the limitation of static caching strategies.

10 retrieved papers

Can Refute

Adaptive Parallel Decoding (APD) mechanism

Can Refute

3 retrieved papers

APD replaces fixed confidence thresholds with a dynamic per-token unmasking strategy that adjusts decoding thresholds based on the predicted distribution of each token. This enables early commitment to confident predictions while postponing uncertain ones, achieving a better trade-off between speed and output quality.

3 retrieved papers

Can Refute

Dynamic-dLLM training-free acceleration framework

Can Refute

10 retrieved papers

Dynamic-dLLM is a plug-and-play training-free framework that combines DCU and APD to accelerate diffusion LLM inference. It addresses the cubic computational complexity of dLLMs by accounting for dynamic token behavior across layers and decoding steps, achieving significant speedups while maintaining performance.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[10] dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching PDF

Liu Zhi-Yuan, Yang Yicun, Zhang, Yaojie, Chen Junjie, Zou Chang, Wei QingYuan, Wang Shao-Bo, Linfeng (2025) • arXiv.org

[33] d2Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching PDF

Jiang Yuchu, Cai Yue, Yuchu Jiang, Luo, Xiangzhong, Yue Cai, Fu, Jiale, Xiangzhong Luo, Wang JiaRui, Jiale Fu, Liu Chonghan, Jiarui Wang, Xu Yang, Chonghan Liu (2025)

[35] dCache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching PDF

Y Jiang, Y Cai, X Luo, J Fu, J Wang, C Liu (2025)

[45] Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction PDF

Liu Xiaoran, Li Ruixiao, Huang, Zengfeng, Guo, Qipeng, He, Ziwei, Qiu, Xipeng (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Dynamic Cache Updating (DCU) mechanism

[55] Lethe: Layer-and time-adaptive kv cache pruning for reasoning-intensive llm serving PDF

Can Refute

[59] Accelerating Diffusion Transformers with Token-wise Feature Caching PDF

Can Refute

[62] D2o: Dynamic discriminative operations for efficient long-context inference of large language models PDF

Can Refute

[54] Task-kv: Task-aware kv cache optimization via semantic differentiation of attention heads PDF

Cannot Refute

[56] Layerkv: Optimizing large language model serving with layer-wise kv cache management PDF

Cannot Refute

[57] ZipVL: Accelerating Vision-Language Models through Dynamic Token Sparsity PDF

Cannot Refute

[58] Dynamickv: Task-aware adaptive kv cache compression for long context llms PDF

Cannot Refute

[60] FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction PDF

Cannot Refute

[61] VL-cache: Sparsity and modality-aware KV cache compression for vision-language model inference acceleration PDF

Cannot Refute

[63] Zipvl: Efficient large vision-language models with dynamic token sparsification and kv cache compression PDF

Cannot Refute

Contribution

Adaptive Parallel Decoding (APD) mechanism

[53] Beyond Static Cutoffs: One-Shot Dynamic Thresholding for Diffusion Language Models PDF

Can Refute

[51] Self-rag: Self-reflective retrieval augmented generation PDF

Cannot Refute

[52] Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding PDF

Cannot Refute

Contribution

Dynamic-dLLM training-free acceleration framework

[6] dinfer: An efficient inference framework for diffusion language models PDF

Can Refute

[16] Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding PDF

Can Refute

[8] dKV-Cache: The Cache for Diffusion Language Models PDF

Cannot Refute

[21] Plan for Speed - Dilated Scheduling for Masked Diffusion Language Models PDF

Cannot Refute

[38] Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model PDF

Cannot Refute

[64] Beyond fixed: Training-free variable-length denoising for diffusion large language models PDF

Cannot Refute

[65] -DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers PDF

Cannot Refute

[66] Adaptive Guidance: Training-free Acceleration of Conditional Diffusion Models PDF

Cannot Refute

[67] Post-Training Quantization on Diffusion Models PDF

Cannot Refute

[68] Likelihood-Based Diffusion Language Models PDF

Cannot Refute

Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[10] dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching PDF

[33] d2Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching PDF

[35] dCache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching PDF

[45] Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction PDF

Contribution Analysis

Dynamic Cache Updating (DCU) mechanism

[55] Lethe: Layer-and time-adaptive kv cache pruning for reasoning-intensive llm serving PDF

[59] Accelerating Diffusion Transformers with Token-wise Feature Caching PDF

[62] D2o: Dynamic discriminative operations for efficient long-context inference of large language models PDF

[54] Task-kv: Task-aware kv cache optimization via semantic differentiation of attention heads PDF

[56] Layerkv: Optimizing large language model serving with layer-wise kv cache management PDF

[57] ZipVL: Accelerating Vision-Language Models through Dynamic Token Sparsity PDF

[58] Dynamickv: Task-aware adaptive kv cache compression for long context llms PDF

[60] FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction PDF

[61] VL-cache: Sparsity and modality-aware KV cache compression for vision-language model inference acceleration PDF

[63] Zipvl: Efficient large vision-language models with dynamic token sparsification and kv cache compression PDF

Adaptive Parallel Decoding (APD) mechanism

[53] Beyond Static Cutoffs: One-Shot Dynamic Thresholding for Diffusion Language Models PDF

[51] Self-rag: Self-reflective retrieval augmented generation PDF

[52] Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding PDF

Dynamic-dLLM training-free acceleration framework

[6] dinfer: An efficient inference framework for diffusion language models PDF

[16] Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding PDF

[8] dKV-Cache: The Cache for Diffusion Language Models PDF

[21] Plan for Speed - Dilated Scheduling for Masked Diffusion Language Models PDF

[38] Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model PDF

[64] Beyond fixed: Training-free variable-length denoising for diffusion large language models PDF

[65] -DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers PDF

[66] Adaptive Guidance: Training-free Acceleration of Conditional Diffusion Models PDF

[67] Post-Training Quantization on Diffusion Models PDF

[68] Likelihood-Based Diffusion Language Models PDF

Table of Contents