Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Diffusion Large Language ModelsDiscrete Diffusion ModelsInference AccelerationKV CacheAR-Diffusion Hybrid

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs to achieve rapid convergence.We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than $\mathbf{2.5\times}$ inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to the vanilla dLLMs like LLaDA and Dream, the acceleration can be more than $\mathbf{50\times}$ while maintaining comparable output quality.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Discrete Diffusion Forcing (D2F), a framework that enables diffusion language models to perform block-wise autoregressive generation with inter-block parallel decoding. It sits within the 'Adaptive and Learnable Parallel Decoding' leaf of the taxonomy, which contains five papers total including this one. This leaf represents a moderately active research direction focused on dynamically adjusting parallelism during diffusion inference. The taxonomy shows this is one of four parallel decoding strategy categories, indicating a reasonably crowded subfield with multiple competing approaches to accelerating diffusion-based text generation.

The taxonomy reveals several neighboring research directions that contextualize this work. Sibling categories include 'Speculative and Exploration-Based Decoding' (two papers) and 'Block-Wise and Truncated Block Generation' (two papers), both exploring alternative parallelization patterns. The 'Hybrid Autoregressive-Diffusion Architectures' branch (two papers) addresses similar AR-diffusion integration but through architectural modifications rather than inference strategies. The scope notes clarify that D2F's combination of block-wise AR generation with diffusion-based parallel decoding distinguishes it from pure fixed-block methods and from approaches that modify model architecture during training rather than inference patterns.

Among the three contributions analyzed, the asymmetric distillation strategy shows the most substantial prior work overlap: three of four candidates examined appear to provide refuting evidence, suggesting this training technique has precedent in the limited search scope of 24 total candidates. The D2F framework itself and the pipelined parallel decoding algorithm show no clear refutations among 10 candidates each examined. This pattern suggests the core inference mechanism may be more novel than the training approach, though the limited search scale (24 candidates total, not hundreds) means these findings reflect top semantic matches rather than exhaustive coverage of all relevant literature.

Based on the taxonomy structure and contribution-level statistics from this limited search, the work appears to occupy a moderately explored research direction with some novel elements. The framework's hybrid AR-diffusion paradigm and pipelined decoding show less direct overlap in the examined candidates, while the distillation strategy connects to established training techniques. The analysis covers top-30 semantic matches and does not claim comprehensive coverage of all potentially relevant work in diffusion language model acceleration.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Accelerating diffusion language model inference through block-wise parallel decoding. The field structure reflects a multifaceted effort to overcome the sequential bottleneck inherent in autoregressive generation by leveraging diffusion-based frameworks. The taxonomy organizes work into several main branches: Parallel Decoding Strategies and Algorithms, which explores methods for generating multiple tokens or blocks simultaneously; Model Architecture and Training Paradigms, focusing on how diffusion models are designed and trained for language tasks; Key-Value Cache Mechanisms and Memory Optimization, addressing efficiency bottlenecks in attention computation; Comparative Analysis and Survey Studies, providing overviews of discrete diffusion and diffusion language models more broadly (e.g., Discrete Diffusion Survey[9], Diffusion LLM Survey[11]); and Related Diffusion Model Techniques, capturing auxiliary innovations. Within Parallel Decoding Strategies, a particularly active subarea is Adaptive and Learnable Parallel Decoding, where methods dynamically adjust decoding granularity or learn to predict multiple tokens in parallel, contrasting with fixed-block or static sampling approaches. Recent work in adaptive and learnable parallel decoding has explored diverse trade-offs between flexibility, training overhead, and inference speedup. Adaptive Parallel Decoding[1] and Learning to Parallel[3] exemplify efforts to make decoding strategies context-sensitive, while Learnable Parallel Decoding[4] and dParallel[12] investigate trainable modules that predict block-level content. Diffusion Forcing[0] sits naturally within this cluster, emphasizing block-wise parallel generation through a diffusion framework that balances parallelism with coherence across token blocks. Compared to neighboring methods like Adaptive Parallel Decoding[1], which may rely on heuristic adaptation, or Learning to Parallel[3], which focuses on learning when to parallelize, Diffusion Forcing[0] integrates the diffusion process itself as the mechanism for parallel block refinement. This positions it as a method that leverages the iterative denoising nature of diffusion models to achieve both speed and quality, addressing open questions about how to best combine learned parallelism with the inherent structure of diffusion-based language generation.

Claimed Contributions

Discrete Diffusion Forcing (D2F) framework

10 retrieved papers

D2F is a novel training paradigm that extends diffusion forcing to discrete sequences, enabling dLLMs to perform block-wise autoregressive generation with KV cache compatibility while supporting inter-block parallel decoding. This creates an AR-diffusion hybrid paradigm for efficient inference.

10 retrieved papers

Asymmetric distillation strategy

Can Refute

4 retrieved papers

The method distills D2F dLLMs from existing pre-trained bidirectional dLLMs using an asymmetric loss where the teacher model uses global bidirectional attention while the student learns with block-wise causal attention, enabling efficient adaptation without costly retraining.

4 retrieved papers

Can Refute

Pipelined parallel decoding algorithm

10 retrieved papers

The algorithm dynamically manages a sliding window of active blocks with dual-state decoding (semi-activated and fully-activated states), enabling efficient inter-block parallelism while maintaining accurate KV cache reuse and offering a trade-off between efficiency and performance.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Accelerating Diffusion LLMs via Adaptive Parallel Decoding PDF

Israel, Daniel, Broeck, Guy Van den, Daniel Israel, Grover, Aditya, Guy Van den Broeck, Aditya Grover (2025)

[3] Learning to Parallel: Accelerating Diffusion Large Language Models via Adaptive Parallel Decoding PDF

Bao Wenrui, Chen Zhi-ben, Wenrui Bao, Xu Dan, Zhiben Chen, Shang, Yuzhang, Dan Xu, Yuzhang Shang (2025) • arXiv.org

[4] Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding PDF

Bao Wenrui, Chen Zhi-ben, Wenrui Bao, Xu Dan, Zhiben Chen, Shang, Yuzhang, Dan Xu, Yuzhang Shang (2025)

[12] dParallel: Learnable Parallel Decoding for dLLMs PDF

Chen Zigeng, Fang, Gongfan, Ma, Xinyin, Yu, Ruonan, Wang XinChao (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Discrete Diffusion Forcing (D2F) framework

[18] ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding PDF

Cannot Refute

[41] Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding PDF

Cannot Refute

[42] Autoregressive Diffusion Models PDF

Cannot Refute

[43] Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies PDF

Cannot Refute

[44] Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model PDF

Cannot Refute

[45] LLaDA2. 0: Scaling Up Diffusion Language Models to 100B PDF

Cannot Refute

[46] Diffusion llm with native variable generation lengths: Let lead the way PDF

Cannot Refute

[47] GenMol: A Drug Discovery Generalist with Discrete Diffusion PDF

Cannot Refute

[48] Multi-scale Autoregressive Models are Laplacian, Discrete, and Latent Diffusion Models in Disguise PDF

Cannot Refute

[49] LLaDA-Rec: Discrete Diffusion for Parallel Semantic ID Generation in Generative Recommendation PDF

Cannot Refute

Contribution

Asymmetric distillation strategy

[37] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models PDF

Can Refute

[39] TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models PDF

Can Refute

[40] Train Rich, Serve Causal (TRiSeC): A GuruâShishya PDF

Can Refute

[38] Motionstream: Real-time video generation with interactive motion controls PDF

Cannot Refute

Contribution

Pipelined parallel decoding algorithm

[14] Set block decoding is a language model inference accelerator PDF

Cannot Refute

[28] Scope: Optimizing key-value cache compression in long-context generation PDF

Cannot Refute

[29] Apar: Llms can do auto-parallel auto-regressive decoding PDF

Cannot Refute

[30] SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference PDF

Cannot Refute

[31] RandAR: Decoder-only Autoregressive Visual Generation in Random Orders PDF

Cannot Refute

[32] Aspd: Unlocking adaptive serial-parallel decoding by exploring intrinsic parallelism in llms PDF

Cannot Refute

[33] LaViDa: A Large Diffusion Language Model for Multimodal Understanding PDF

Cannot Refute

[34] Plato: Plan to Efficient Decode for Large Language Model Inference PDF

Cannot Refute

[35] Autoregressive Image Generation with Randomized Parallel Decoding PDF

Cannot Refute

[36] KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation PDF

Cannot Refute

Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Accelerating Diffusion LLMs via Adaptive Parallel Decoding PDF

[3] Learning to Parallel: Accelerating Diffusion Large Language Models via Adaptive Parallel Decoding PDF

[4] Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding PDF

[12] dParallel: Learnable Parallel Decoding for dLLMs PDF

Contribution Analysis

Discrete Diffusion Forcing (D2F) framework

[18] ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding PDF

[41] Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding PDF

[42] Autoregressive Diffusion Models PDF

[43] Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies PDF

[44] Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model PDF

[45] LLaDA2. 0: Scaling Up Diffusion Language Models to 100B PDF

[46] Diffusion llm with native variable generation lengths: Let lead the way PDF

[47] GenMol: A Drug Discovery Generalist with Discrete Diffusion PDF

[48] Multi-scale Autoregressive Models are Laplacian, Discrete, and Latent Diffusion Models in Disguise PDF

[49] LLaDA-Rec: Discrete Diffusion for Parallel Semantic ID Generation in Generative Recommendation PDF

Asymmetric distillation strategy

[37] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models PDF

[39] TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models PDF

[40] Train Rich, Serve Causal (TRiSeC): A GuruâShishya PDF

[38] Motionstream: Real-time video generation with interactive motion controls PDF

Pipelined parallel decoding algorithm

[14] Set block decoding is a language model inference accelerator PDF

[28] Scope: Optimizing key-value cache compression in long-context generation PDF

[29] Apar: Llms can do auto-parallel auto-regressive decoding PDF

[30] SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference PDF

[31] RandAR: Decoder-only Autoregressive Visual Generation in Random Orders PDF

[32] Aspd: Unlocking adaptive serial-parallel decoding by exploring intrinsic parallelism in llms PDF

[33] LaViDa: A Large Diffusion Language Model for Multimodal Understanding PDF

[34] Plato: Plan to Efficient Decode for Large Language Model Inference PDF

[35] Autoregressive Image Generation with Randomized Parallel Decoding PDF

[36] KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation PDF

Table of Contents

[40] Train Rich, Serve Causal (TRiSeC): A GuruâShishya PDF