Fast-dLLM v2: Efficient Block-Diffusion LLM

ICLR 2026 Conference SubmissionAnonymous Authors
Diffusion LLMEfficient AI
Abstract:

Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation—requiring only ∼1B tokens of fine-tuning. This represents a 500× reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model’s performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5× speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs—marking a significant step toward the practical deployment of fast and accurate LLMs. Code and model will be publicly released.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Fast-dLLM v2, a method for converting pretrained autoregressive models into block diffusion language models using approximately 1B tokens of fine-tuning. It resides in the 'Autoregressive-to-Diffusion Conversion and Adaptation' leaf, which contains four papers total including the original work. This leaf sits within the broader 'Block Diffusion Architecture and Training Methods' branch, indicating a moderately populated research direction focused specifically on efficient AR-to-diffusion conversion rather than training diffusion models from scratch.

The taxonomy reveals neighboring research directions including 'Novel Architecture Design for Block Diffusion' (five papers exploring fundamentally new architectures) and 'Variable-Length and Adaptive Block Generation' (four papers on dynamic block sizing). The paper's leaf is distinguished by its focus on knowledge inheritance from pretrained models rather than architectural novelty. Adjacent branches cover 'Inference Optimization and Acceleration Techniques' with specialized work on KV cache optimization and controllability, suggesting the paper bridges architectural adaptation with inference acceleration concerns through its hierarchical caching mechanism.

Among three contributions analyzed from 20 candidate papers examined, the data-efficient post-training strategy shows substantial prior work: 9 candidates examined, 6 potentially refutable. The hierarchical caching mechanism appears more novel with only 1 candidate examined and none refutable. The speedup validation examined 10 candidates with 3 potentially refutable. This limited search scope suggests the conversion strategy operates in a crowded space alongside works like LLaDA and Next-Block Adaptation, while the specific caching design may represent a less explored technical direction within the broader field.

Based on this top-20 semantic search, the work appears to make incremental contributions to AR-to-diffusion conversion methodology, situated in a moderately active research area. The most distinctive element may be the hierarchical caching design, though the limited candidate pool prevents definitive assessment. The analysis does not cover exhaustive citation networks or recent unpublished work that might reveal additional overlaps.

Taxonomy

Core-task Taxonomy Papers
26
3
Claimed Contributions
20
Contribution Candidate Papers Compared
9
Refutable Paper

Research Landscape Overview

Core task: Accelerating large language model inference through block diffusion. The field has coalesced around the idea of replacing token-by-token autoregressive generation with block-level diffusion processes that predict multiple tokens simultaneously. The taxonomy reflects four main branches: Block Diffusion Architecture and Training Methods explores how to convert or adapt pretrained autoregressive models into diffusion frameworks, including techniques for initializing diffusion parameters and designing block-level objectives; Inference Optimization and Acceleration Techniques focuses on runtime strategies such as adaptive block sizing, efficient caching mechanisms (e.g., Attention KV Cache[7], dCache[15]), and variable-length generation schemes (Variable Generation Lengths[16]); Domain-Specific Applications and Extensions examines how block diffusion extends to multimodal settings (DiffusionVL[20], Audio-Language Joint[22]) and specialized tasks; and Surveys and Comparative Studies (Diffusion LLM Survey[4]) provide overarching perspectives on the trade-offs between diffusion and autoregressive paradigms. Within the architecture and training branch, a particularly active line of work addresses autoregressive-to-diffusion conversion and adaptation. Fast-dLLM v2[0] sits squarely in this cluster, proposing methods to efficiently transform existing autoregressive checkpoints into block diffusion models without full retraining. Nearby efforts such as LLaDA[10] and Next-Block Adaptation[11] similarly tackle the challenge of adapting pretrained weights to predict token blocks rather than single tokens, while Efficient-DLM[14] emphasizes computational efficiency during the conversion process. These works share a common goal of leveraging the vast investment in autoregressive pretraining while unlocking the parallelism benefits of diffusion inference. The main open questions revolve around how much fine-tuning is necessary, whether certain architectural modifications (e.g., Block Transformer[12]) improve block-level coherence, and how to balance the speed gains from parallel decoding against potential quality degradation compared to standard autoregressive baselines.

Claimed Contributions

Data-efficient post-training strategy for adapting AR models to block-diffusion frameworks

The authors propose a method to convert pretrained autoregressive language models into block diffusion models using only approximately 1 billion tokens of fine-tuning, which is 500 times less data than full-attention diffusion models like Dream that require around 500 billion tokens. This is achieved through a novel training recipe combining block diffusion with complementary attention masking.

9 retrieved papers
Can Refute
Hierarchical caching mechanism with block-level and sub-block caches

The authors design a two-level caching system: a block-level cache that stores historical context representations across blocks, and a sub-block cache (DualCache) that enables efficient parallel generation within partially decoded blocks. This hierarchical approach substantially accelerates inference compared to prior diffusion methods.

1 retrieved paper
Comprehensive large-scale validation achieving 2.5× speedup over AR decoding

The authors perform extensive experiments on models up to 7 billion parameters across diverse benchmarks, demonstrating that their approach achieves up to 2.5 times faster inference than standard autoregressive decoding while maintaining generation quality comparable to strong autoregressive baselines.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Data-efficient post-training strategy for adapting AR models to block-diffusion frameworks

The authors propose a method to convert pretrained autoregressive language models into block diffusion models using only approximately 1 billion tokens of fine-tuning, which is 500 times less data than full-attention diffusion models like Dream that require around 500 billion tokens. This is achieved through a novel training recipe combining block diffusion with complementary attention masking.

Contribution

Hierarchical caching mechanism with block-level and sub-block caches

The authors design a two-level caching system: a block-level cache that stores historical context representations across blocks, and a sub-block cache (DualCache) that enables efficient parallel generation within partially decoded blocks. This hierarchical approach substantially accelerates inference compared to prior diffusion methods.

Contribution

Comprehensive large-scale validation achieving 2.5× speedup over AR decoding

The authors perform extensive experiments on models up to 7 billion parameters across diverse benchmarks, demonstrating that their approach achieves up to 2.5 times faster inference than standard autoregressive decoding while maintaining generation quality comparable to strong autoregressive baselines.

Fast-dLLM v2: Efficient Block-Diffusion LLM | Novelty Validation