Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

ICLR 2026 Conference SubmissionAnonymous Authors
Diffusion Large Language ModelsDiscrete Diffusion ModelsInference AccelerationKV CacheAR-Diffusion Hybrid
Abstract:

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs to achieve rapid convergence.We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than 2.5×\mathbf{2.5\times} inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to the vanilla dLLMs like LLaDA and Dream, the acceleration can be more than 50×\mathbf{50\times} while maintaining comparable output quality.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Discrete Diffusion Forcing (D2F), a framework that enables diffusion language models to perform block-wise autoregressive generation with inter-block parallel decoding. It sits within the 'Adaptive and Learnable Parallel Decoding' leaf of the taxonomy, which contains five papers total including this one. This leaf represents a moderately active research direction focused on dynamically adjusting parallelism during diffusion inference. The taxonomy shows this is one of four parallel decoding strategy categories, indicating a reasonably crowded subfield with multiple competing approaches to accelerating diffusion-based text generation.

The taxonomy reveals several neighboring research directions that contextualize this work. Sibling categories include 'Speculative and Exploration-Based Decoding' (two papers) and 'Block-Wise and Truncated Block Generation' (two papers), both exploring alternative parallelization patterns. The 'Hybrid Autoregressive-Diffusion Architectures' branch (two papers) addresses similar AR-diffusion integration but through architectural modifications rather than inference strategies. The scope notes clarify that D2F's combination of block-wise AR generation with diffusion-based parallel decoding distinguishes it from pure fixed-block methods and from approaches that modify model architecture during training rather than inference patterns.

Among the three contributions analyzed, the asymmetric distillation strategy shows the most substantial prior work overlap: three of four candidates examined appear to provide refuting evidence, suggesting this training technique has precedent in the limited search scope of 24 total candidates. The D2F framework itself and the pipelined parallel decoding algorithm show no clear refutations among 10 candidates each examined. This pattern suggests the core inference mechanism may be more novel than the training approach, though the limited search scale (24 candidates total, not hundreds) means these findings reflect top semantic matches rather than exhaustive coverage of all relevant literature.

Based on the taxonomy structure and contribution-level statistics from this limited search, the work appears to occupy a moderately explored research direction with some novel elements. The framework's hybrid AR-diffusion paradigm and pipelined decoding show less direct overlap in the examined candidates, while the distillation strategy connects to established training techniques. The analysis covers top-30 semantic matches and does not claim comprehensive coverage of all potentially relevant work in diffusion language model acceleration.

Taxonomy

Core-task Taxonomy Papers
27
3
Claimed Contributions
24
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Accelerating diffusion language model inference through block-wise parallel decoding. The field structure reflects a multifaceted effort to overcome the sequential bottleneck inherent in autoregressive generation by leveraging diffusion-based frameworks. The taxonomy organizes work into several main branches: Parallel Decoding Strategies and Algorithms, which explores methods for generating multiple tokens or blocks simultaneously; Model Architecture and Training Paradigms, focusing on how diffusion models are designed and trained for language tasks; Key-Value Cache Mechanisms and Memory Optimization, addressing efficiency bottlenecks in attention computation; Comparative Analysis and Survey Studies, providing overviews of discrete diffusion and diffusion language models more broadly (e.g., Discrete Diffusion Survey[9], Diffusion LLM Survey[11]); and Related Diffusion Model Techniques, capturing auxiliary innovations. Within Parallel Decoding Strategies, a particularly active subarea is Adaptive and Learnable Parallel Decoding, where methods dynamically adjust decoding granularity or learn to predict multiple tokens in parallel, contrasting with fixed-block or static sampling approaches. Recent work in adaptive and learnable parallel decoding has explored diverse trade-offs between flexibility, training overhead, and inference speedup. Adaptive Parallel Decoding[1] and Learning to Parallel[3] exemplify efforts to make decoding strategies context-sensitive, while Learnable Parallel Decoding[4] and dParallel[12] investigate trainable modules that predict block-level content. Diffusion Forcing[0] sits naturally within this cluster, emphasizing block-wise parallel generation through a diffusion framework that balances parallelism with coherence across token blocks. Compared to neighboring methods like Adaptive Parallel Decoding[1], which may rely on heuristic adaptation, or Learning to Parallel[3], which focuses on learning when to parallelize, Diffusion Forcing[0] integrates the diffusion process itself as the mechanism for parallel block refinement. This positions it as a method that leverages the iterative denoising nature of diffusion models to achieve both speed and quality, addressing open questions about how to best combine learned parallelism with the inherent structure of diffusion-based language generation.

Claimed Contributions

Discrete Diffusion Forcing (D2F) framework

D2F is a novel training paradigm that extends diffusion forcing to discrete sequences, enabling dLLMs to perform block-wise autoregressive generation with KV cache compatibility while supporting inter-block parallel decoding. This creates an AR-diffusion hybrid paradigm for efficient inference.

10 retrieved papers
Asymmetric distillation strategy

The method distills D2F dLLMs from existing pre-trained bidirectional dLLMs using an asymmetric loss where the teacher model uses global bidirectional attention while the student learns with block-wise causal attention, enabling efficient adaptation without costly retraining.

4 retrieved papers
Can Refute
Pipelined parallel decoding algorithm

The algorithm dynamically manages a sliding window of active blocks with dual-state decoding (semi-activated and fully-activated states), enabling efficient inter-block parallelism while maintaining accurate KV cache reuse and offering a trade-off between efficiency and performance.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Discrete Diffusion Forcing (D2F) framework

D2F is a novel training paradigm that extends diffusion forcing to discrete sequences, enabling dLLMs to perform block-wise autoregressive generation with KV cache compatibility while supporting inter-block parallel decoding. This creates an AR-diffusion hybrid paradigm for efficient inference.

Contribution

Asymmetric distillation strategy

The method distills D2F dLLMs from existing pre-trained bidirectional dLLMs using an asymmetric loss where the teacher model uses global bidirectional attention while the student learns with block-wise causal attention, enabling efficient adaptation without costly retraining.

Contribution

Pipelined parallel decoding algorithm

The algorithm dynamically manages a sliding window of active blocks with dual-state decoding (semi-activated and fully-activated states), enabling efficient inter-block parallelism while maintaining accurate KV cache reuse and offering a trade-off between efficiency and performance.

Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing | Novelty Validation