Abstract:

Diffusion Large Language Models (dLLMs) offer a promising alternative to autoregressive models, excelling in text generation tasks due to their bidirectional attention mechanisms. However, their computational complexity, scaling as O(L3)\mathcal{O}(L^3) with sequence length LL, poses significant challenges for long-sequence and real-time applications, primarily due to the lack of compatibility with key-value caching and the non-autoregressive nature of denoising steps. Existing acceleration methods rely on static caching or parallel decoding strategies, which fail to account for the dynamic behavior of token properties across layers and decoding steps. We propose \textbf{Dynamic-dLLM}, a training-free framework that enhances dLLM inference efficiency through two components: Dynamic Cache Updating (DCU), which adaptively allocates cache-update budgets based on layer-wise token dynamics, and Adaptive Parallel Decoding (APD), which dynamically calibrates decoding thresholds to balance generation quality and efficiency. Extensive experiments on models like LLaDA-8B-Instruct, LLaDA-1.5, and Dream-v0-7B-Instruct across benchmarks such as MMLU, GSM8K, and HumanEval demonstrate that Dynamic-dLLM significantly improves inference speed, attaining an average speedup of exceeding 3×\times while maintaining performance. Dynamic-dLLM outperforms state-of-the-art acceleration methods and provides a plug-and-play solution for efficient dLLM deployment without compromising performance. Code and models will be made publicly available.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Dynamic-dLLM, a training-free framework combining Dynamic Cache Updating (DCU) and Adaptive Parallel Decoding (APD) to accelerate diffusion language model inference. It resides in the 'Adaptive Cache Management and Eviction' leaf, which contains five papers total including this one. This leaf sits within the broader 'Caching Mechanisms for Diffusion LLMs' branch, indicating a moderately active research direction focused on reducing redundant computation through selective cache retention. The taxonomy reveals this is a well-populated area with multiple competing approaches to cache management.

The taxonomy shows neighboring leaves include 'KV-Cache Adaptation for Bidirectional Attention' (three papers on enabling traditional caching in bidirectional models) and several parallel decoding categories ('Confidence-Based Parallel Decoding', 'Adaptive and Learnable Parallel Decoding'). Dynamic-dLLM bridges these areas by combining adaptive caching with parallel decoding, positioning it at the intersection of two major acceleration paradigms. The taxonomy's scope notes clarify that this leaf excludes static KV-cache implementations and parallel-only methods, emphasizing the focus on dynamic, layer-aware cache management strategies that respond to token-level dynamics during generation.

Among the twenty-three candidates examined, the DCU mechanism shows overlap with three prior works out of ten candidates reviewed, while APD overlaps with one of three candidates examined. The combined framework shows overlap with two of ten candidates. These statistics suggest that while individual components have some precedent in the limited search scope, the specific combination and dynamic calibration approach may offer incremental novelty. The relatively small candidate pool (twenty-three total) means these findings reflect top semantic matches rather than exhaustive coverage of the field's prior work.

Based on the limited search scope, the work appears to synthesize existing acceleration paradigms—adaptive caching and parallel decoding—into a unified framework. The taxonomy context reveals this sits in a moderately crowded research direction with multiple sibling papers exploring similar cache eviction strategies. The contribution-level statistics indicate partial overlap with prior work, though the specific dynamic calibration mechanisms may differentiate it from static or fixed-threshold approaches documented in the examined candidates.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
23
Contribution Candidate Papers Compared
6
Refutable Paper

Research Landscape Overview

Core task: Accelerating inference of diffusion large language models. The field has organized itself around several complementary strategies for speeding up diffusion-based text generation. Caching mechanisms (e.g., dLLM-Cache[10], dCache[35], d2Cache[33]) focus on reusing intermediate computations across diffusion steps to reduce redundant calculations. Parallel and adaptive decoding strategies (e.g., DiffuSpec[14], CreditDecoding[4], Learnable Parallel[29]) explore ways to predict or generate multiple tokens or steps simultaneously, trading off accuracy for throughput. Model architecture and training innovations (e.g., Sequential Diffusion[5], Block Diffusion[9], Seed Diffusion[15]) redesign the diffusion process itself to require fewer steps or simpler operations. Model compression and quantization (e.g., DLLMQuant[3], SparseD[17], Activation Sparsity[18]) reduce memory and compute footprints by pruning or quantizing model weights and activations. Meanwhile, diffusion LLM frameworks and benchmarks (e.g., dinfer[6], Dream[1], Diffusion Survey[2]) provide infrastructure and evaluation protocols, while related acceleration techniques borrow ideas from autoregressive LLM speedup methods like speculative sampling. Within this landscape, adaptive cache management and eviction strategies represent a particularly active line of work, balancing memory constraints against the need to preserve useful intermediate states across diffusion steps. Dynamic-dLLM[0] sits squarely in this branch, emphasizing runtime decisions about which cached entries to retain or discard based on evolving generation context. This contrasts with static caching schemes like dLLM-Cache[10] or d2Cache[33], which apply more uniform retention policies. Nearby works such as Sparse-dLLM[45] and dKV-Cache[8] also explore selective retention but differ in their eviction criteria—some prioritize attention scores, others leverage sparsity patterns in activations. The central trade-off across these methods is between cache hit rates and memory overhead, with open questions remaining about how to best adapt eviction policies to diverse prompt distributions and diffusion schedules.

Claimed Contributions

Dynamic Cache Updating (DCU) mechanism

DCU is a mechanism that dynamically distributes cache-update budgets across layers according to the varying dynamics of tokens at different layers. It prioritizes layers requiring frequent updates while reducing computational overhead in stable layers, addressing the limitation of static caching strategies.

10 retrieved papers
Can Refute
Adaptive Parallel Decoding (APD) mechanism

APD replaces fixed confidence thresholds with a dynamic per-token unmasking strategy that adjusts decoding thresholds based on the predicted distribution of each token. This enables early commitment to confident predictions while postponing uncertain ones, achieving a better trade-off between speed and output quality.

3 retrieved papers
Can Refute
Dynamic-dLLM training-free acceleration framework

Dynamic-dLLM is a plug-and-play training-free framework that combines DCU and APD to accelerate diffusion LLM inference. It addresses the cubic computational complexity of dLLMs by accounting for dynamic token behavior across layers and decoding steps, achieving significant speedups while maintaining performance.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Dynamic Cache Updating (DCU) mechanism

DCU is a mechanism that dynamically distributes cache-update budgets across layers according to the varying dynamics of tokens at different layers. It prioritizes layers requiring frequent updates while reducing computational overhead in stable layers, addressing the limitation of static caching strategies.

Contribution

Adaptive Parallel Decoding (APD) mechanism

APD replaces fixed confidence thresholds with a dynamic per-token unmasking strategy that adjusts decoding thresholds based on the predicted distribution of each token. This enables early commitment to confident predictions while postponing uncertain ones, achieving a better trade-off between speed and output quality.

Contribution

Dynamic-dLLM training-free acceleration framework

Dynamic-dLLM is a plug-and-play training-free framework that combines DCU and APD to accelerate diffusion LLM inference. It addresses the cubic computational complexity of dLLMs by accounting for dynamic token behavior across layers and decoding steps, achieving significant speedups while maintaining performance.