Fast-dLLM v2: Efficient Block-Diffusion LLM

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Diffusion LLMEfficient AI

Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation—requiring only ∼1B tokens of fine-tuning. This represents a 500× reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model’s performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5× speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs—marking a significant step toward the practical deployment of fast and accurate LLMs. Code and model will be publicly released.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Fast-dLLM v2, a method for converting pretrained autoregressive models into block diffusion language models using approximately 1B tokens of fine-tuning. It resides in the 'Autoregressive-to-Diffusion Conversion and Adaptation' leaf, which contains four papers total including the original work. This leaf sits within the broader 'Block Diffusion Architecture and Training Methods' branch, indicating a moderately populated research direction focused specifically on efficient AR-to-diffusion conversion rather than training diffusion models from scratch.

The taxonomy reveals neighboring research directions including 'Novel Architecture Design for Block Diffusion' (five papers exploring fundamentally new architectures) and 'Variable-Length and Adaptive Block Generation' (four papers on dynamic block sizing). The paper's leaf is distinguished by its focus on knowledge inheritance from pretrained models rather than architectural novelty. Adjacent branches cover 'Inference Optimization and Acceleration Techniques' with specialized work on KV cache optimization and controllability, suggesting the paper bridges architectural adaptation with inference acceleration concerns through its hierarchical caching mechanism.

Among three contributions analyzed from 20 candidate papers examined, the data-efficient post-training strategy shows substantial prior work: 9 candidates examined, 6 potentially refutable. The hierarchical caching mechanism appears more novel with only 1 candidate examined and none refutable. The speedup validation examined 10 candidates with 3 potentially refutable. This limited search scope suggests the conversion strategy operates in a crowded space alongside works like LLaDA and Next-Block Adaptation, while the specific caching design may represent a less explored technical direction within the broader field.

Based on this top-20 semantic search, the work appears to make incremental contributions to AR-to-diffusion conversion methodology, situated in a moderately active research area. The most distinctive element may be the hierarchical caching design, though the limited candidate pool prevents definitive assessment. The analysis does not cover exhaustive citation networks or recent unpublished work that might reveal additional overlaps.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Accelerating large language model inference through block diffusion. The field has coalesced around the idea of replacing token-by-token autoregressive generation with block-level diffusion processes that predict multiple tokens simultaneously. The taxonomy reflects four main branches: Block Diffusion Architecture and Training Methods explores how to convert or adapt pretrained autoregressive models into diffusion frameworks, including techniques for initializing diffusion parameters and designing block-level objectives; Inference Optimization and Acceleration Techniques focuses on runtime strategies such as adaptive block sizing, efficient caching mechanisms (e.g., Attention KV Cache[7], dCache[15]), and variable-length generation schemes (Variable Generation Lengths[16]); Domain-Specific Applications and Extensions examines how block diffusion extends to multimodal settings (DiffusionVL[20], Audio-Language Joint[22]) and specialized tasks; and Surveys and Comparative Studies (Diffusion LLM Survey[4]) provide overarching perspectives on the trade-offs between diffusion and autoregressive paradigms. Within the architecture and training branch, a particularly active line of work addresses autoregressive-to-diffusion conversion and adaptation. Fast-dLLM v2[0] sits squarely in this cluster, proposing methods to efficiently transform existing autoregressive checkpoints into block diffusion models without full retraining. Nearby efforts such as LLaDA[10] and Next-Block Adaptation[11] similarly tackle the challenge of adapting pretrained weights to predict token blocks rather than single tokens, while Efficient-DLM[14] emphasizes computational efficiency during the conversion process. These works share a common goal of leveraging the vast investment in autoregressive pretraining while unlocking the parallelism benefits of diffusion inference. The main open questions revolve around how much fine-tuning is necessary, whether certain architectural modifications (e.g., Block Transformer[12]) improve block-level coherence, and how to balance the speed gains from parallel decoding against potential quality degradation compared to standard autoregressive baselines.

Claimed Contributions

Data-efficient post-training strategy for adapting AR models to block-diffusion frameworks

Can Refute

9 retrieved papers

The authors propose a method to convert pretrained autoregressive language models into block diffusion models using only approximately 1 billion tokens of fine-tuning, which is 500 times less data than full-attention diffusion models like Dream that require around 500 billion tokens. This is achieved through a novel training recipe combining block diffusion with complementary attention masking.

9 retrieved papers

Can Refute

Hierarchical caching mechanism with block-level and sub-block caches

1 retrieved paper

The authors design a two-level caching system: a block-level cache that stores historical context representations across blocks, and a sub-block cache (DualCache) that enables efficient parallel generation within partially decoded blocks. This hierarchical approach substantially accelerates inference compared to prior diffusion methods.

1 retrieved paper

Comprehensive large-scale validation achieving 2.5× speedup over AR decoding

Can Refute

10 retrieved papers

The authors perform extensive experiments on models up to 7 billion parameters across diverse benchmarks, demonstrating that their approach achieves up to 2.5 times faster inference than standard autoregressive decoding while maintaining generation quality comparable to strong autoregressive baselines.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[10] LLaDA2. 0: Scaling Up Diffusion Language Models to 100B PDF

Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Lin Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, Jianfeng Tan, Lanning Wei, Ji-Rong Wen, Yipeng Xing, Xiaolu Zhang, Junbo Zhao, Da Zheng, Jun Zhou, Junlin Zhou, Zhanchao Zhou, Liwang Zhu, Yihong Zhuang (2025)

[11] From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs PDF

Yuchuan Tian, Yuchen Liang, Jiacheng Sun, Shuo Zhang, Guangwen Yang, Yingte Shu, Sibo Fang, Tianyu Guo, Kai Han, Chao Xu, Hanting Chen, Xinghao Chen, Yunhe Wang (2025)

[14] Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed PDF

Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, Maksim Khadkevich, Jan Kautz, Yingyan Celine Lin, Pavlo Molchanov (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Data-efficient post-training strategy for adapting AR models to block-diffusion frameworks

[10] LLaDA2. 0: Scaling Up Diffusion Language Models to 100B PDF

Can Refute

[11] From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs PDF

Can Refute

[14] Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed PDF

Can Refute

[20] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models PDF

Can Refute

[38] Sdar: A synergistic diffusion-autoregression paradigm for scalable sequence generation PDF

Can Refute

[41] DiRL: An Efficient Post-Training Framework for Diffusion Language Models PDF

Can Refute

[4] Diffusion-based Large Language Models Survey PDF

Cannot Refute

[39] ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer PDF

Cannot Refute

[40] Blockwise sft for diffusion language models: Reconciling bidirectional attention and autoregressive decoding PDF

Cannot Refute

Contribution

Hierarchical caching mechanism with block-level and sub-block caches

[27] SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size PDF

Cannot Refute

Contribution

Comprehensive large-scale validation achieving 2.5× speedup over AR decoding

[28] Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding PDF

Can Refute

[29] Accelerating diffusion language model inference via efficient kv caching and guided diffusion PDF

Can Refute

[37] Tidar: Think in diffusion, talk in autoregression PDF

Can Refute

[30] Accelerating Diffusion LLMs via Adaptive Parallel Decoding PDF

Cannot Refute

[31] Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding PDF

Cannot Refute

[32] Diffuspec: Unlocking diffusion language models for speculative decoding PDF

Cannot Refute

[33] Speculative diffusion decoding: Accelerating language generation through diffusion PDF

Cannot Refute

[34] Creditdecoding: Accelerating parallel decoding in diffusion large language models with trace credits PDF

Cannot Refute

[35] Lavida: A large diffusion language model for multimodal understanding PDF

Cannot Refute

[36] Accelerated Diffusion Models via Speculative Sampling PDF

Cannot Refute

Fast-dLLM v2: Efficient Block-Diffusion LLM

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[10] LLaDA2. 0: Scaling Up Diffusion Language Models to 100B PDF

[11] From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs PDF

[14] Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed PDF

Contribution Analysis

Data-efficient post-training strategy for adapting AR models to block-diffusion frameworks

[10] LLaDA2. 0: Scaling Up Diffusion Language Models to 100B PDF

[11] From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs PDF

[14] Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed PDF

[20] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models PDF

[38] Sdar: A synergistic diffusion-autoregression paradigm for scalable sequence generation PDF

[41] DiRL: An Efficient Post-Training Framework for Diffusion Language Models PDF

[4] Diffusion-based Large Language Models Survey PDF

[39] ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer PDF

[40] Blockwise sft for diffusion language models: Reconciling bidirectional attention and autoregressive decoding PDF

Hierarchical caching mechanism with block-level and sub-block caches

[27] SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size PDF

Comprehensive large-scale validation achieving 2.5× speedup over AR decoding

[28] Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding PDF

[29] Accelerating diffusion language model inference via efficient kv caching and guided diffusion PDF

[37] Tidar: Think in diffusion, talk in autoregression PDF

[30] Accelerating Diffusion LLMs via Adaptive Parallel Decoding PDF

[31] Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding PDF

[32] Diffuspec: Unlocking diffusion language models for speculative decoding PDF

[33] Speculative diffusion decoding: Accelerating language generation through diffusion PDF

[34] Creditdecoding: Accelerating parallel decoding in diffusion large language models with trace credits PDF

[35] Lavida: A large diffusion language model for multimodal understanding PDF

[36] Accelerated Diffusion Models via Speculative Sampling PDF

Table of Contents