Fast Language Generation through Discrete Diffusion Divergence Instruct

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

discrete diffusion modelsmasked diffusion modelsdistillationintegral KL divergencelarge language modelsgenerative modeling

Fast and high-quality language generation is the holy grail that people pursue in the age of AI. In this work, we introduce Discrete Diffusion Divergence Instruct (DiDi-Instruct), a training-based method that initializes from a pre-trained (masked) discrete diffusion language model (dLLM) and distills a few-step student for fast generation. The resulting DiDi-Instruct model achieves comparable or superior performance to its dLLM teacher and the GPT-2 baseline while enabling up to 64 $\times$ acceleration. The theoretical foundation of DiDi-Instruct is a novel framework based on integral KL-divergence minimization, which yields a practical training algorithm. We further introduce grouped reward normalization, intermediate-state matching, and the reward-guided ancestral sampler that significantly improve training stability, model coverage, and inference quality. On OpenWebText, DiDi-Instruct achieves perplexity from 62.2 (8 NFEs) to 18.4 (128 NFEs), which outperforms prior accelerated dLLMs and GPT-2 baseline. These gains come with a negligible entropy loss (around $1$ %) and reduce additional training wall-clock time by more than $20\times$ compared to competing dLLM distillation methods. We further validate the robustness and effectiveness of DiDi-Instruct through extensive ablation studies, model scaling, downstream tasks, and the generation of discrete protein sequences. In conclusion, DiDi-Instruct is an efficient yet effective distillation method, enabling language generation in the blink of an eye. We will release our code and models along with the paper.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: accelerating discrete diffusion language models through distillation. The field has evolved into several distinct branches that reflect both methodological diversity and domain specialization. Discrete Diffusion Distillation Methods focus on adapting distillation techniques specifically for language and discrete token spaces, often leveraging KL-divergence objectives or specialized sampling strategies tailored to categorical distributions. Image Diffusion Distillation Methods encompass a rich body of work on accelerating continuous diffusion models for visual generation, exploring progressive distillation schemes like Progressive Distillation[7], score-matching approaches such as Score Identity Distillation[3], and one-step or few-step generators including Swiftbrush[1] and Imagine Flash[16]. Cross-Domain and Unified Distillation Frameworks attempt to bridge discrete and continuous settings, proposing architectures or training paradigms that generalize across modalities. Domain-Specific Distillation Applications address tailored challenges in areas like video, audio, or 3D generation, while Survey and Overview papers such as Diffusion Acceleration Survey[21] synthesize emerging trends and open questions across these branches. Within the discrete distillation landscape, a central tension revolves around balancing sample quality, inference speed, and training stability when moving from continuous to categorical spaces. Diffusion Divergence Instruct[0] sits squarely in the KL-Divergence Based Distillation cluster, emphasizing divergence minimization to compress multi-step discrete diffusion into fewer steps for language modeling. Its closest neighbor, Ultra-fast Divergence Instruct[4], shares this KL-centric philosophy but pushes toward even more aggressive step reduction. In contrast, works like Discrete Diffusion Forcing[8] and Absorbing Discrete Diffusion[20] explore alternative parameterizations or absorbing-state dynamics that sidestep some gradient estimation challenges inherent in categorical distillation. Meanwhile, methods such as DKDM[6] and Diffusion Duality[5] investigate dual formulations or knowledge transfer mechanisms that complement divergence-based objectives. The original paper thus contributes to an active subfield where researchers are refining how to faithfully distill discrete generative processes without sacrificing the expressiveness that makes diffusion models attractive for language generation.

Claimed Contributions

DiDi-Instruct: a training-based distillation method for fast language generation

Can Refute

9 retrieved papers

The authors propose DiDi-Instruct, a novel distillation framework that trains a few-step student model from a pre-trained masked discrete diffusion language model. This method achieves comparable or superior performance to the teacher model while enabling up to 64× acceleration in generation speed.

9 retrieved papers

Can Refute

Theoretical framework based on integral KL-divergence minimization

8 retrieved papers

The authors develop a principled training method grounded in minimizing integral KL-divergence between student and teacher distributions. They reformulate the distillation objective using a policy gradient perspective, deriving a tractable update rule that uses an adversarial discriminator to estimate log-density ratios.

8 retrieved papers

Training and inference techniques: grouped reward normalization, intermediate-state matching, and reward-guided ancestral sampler

2 retrieved papers

The authors introduce three key techniques to enhance the distillation process: grouped reward normalization for training stability, intermediate-state matching to prevent mode collapse, and a reward-guided ancestral sampler (RGAS) that improves inference quality through gradient tilting and candidate re-ranking.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] Ultra-fast language generation via discrete diffusion divergence instruct PDF

Zheng Haoyang, Liu Xin-yang, Haoyang Zheng, Xinyang Liu, Jiang Nan, Cindy Xiangrui Kong, Hu, Zheyuan, Nan Jiang, Luo Wei-jian, Zheyuan Hu, Deng Wei, Weijian Luo, Lin Guang, Wei Deng, Guang Lin (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DiDi-Instruct: a training-based distillation method for fast language generation

[17] Learnable Sampler Distillation for Discrete Diffusion Models PDF

Can Refute

[2] Distillation of discrete diffusion through dimensional correlations PDF

Cannot Refute

[3] Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation PDF

Cannot Refute

[4] Ultra-fast language generation via discrete diffusion divergence instruct PDF

Cannot Refute

[5] The diffusion duality PDF

Cannot Refute

[8] Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing PDF

Cannot Refute

[14] Beyond Autoregression: Fast LLMs via Self-Distillation Through Time PDF

Cannot Refute

[38] Compressed and Smooth Latent Space for Text Diffusion Modeling PDF

Cannot Refute

[39] Inference-Time Diffusion Model Distillation PDF

Cannot Refute

Contribution

Theoretical framework based on integral KL-divergence minimization

[29] KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning PDF

Cannot Refute

[30] Score and Distribution Matching Policy: Advanced Accelerated Visuomotor Policies via Matched Distillation PDF

Cannot Refute

[31] Droid: Learning from offline heterogeneous demonstrations via reward-policy distillation PDF

Cannot Refute

[32] Towards searching for the best student in a Knowledge Distillation framework PDF

Cannot Refute

[33] ADPO: Anchored Direct Preference Optimization PDF

Cannot Refute

[34] On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning PDF

Cannot Refute

[35] A Distributional Approach to Controlled Text Generation PDF

Cannot Refute

[36] DISTILLATION AND GENERALIZATION IN DEEP REINFORCEMENT LEARNING PDF

Cannot Refute

Contribution

Training and inference techniques: grouped reward normalization, intermediate-state matching, and reward-guided ancestral sampler

[4] Ultra-fast language generation via discrete diffusion divergence instruct PDF

Cannot Refute

[37] A General Framework for Inference-time Scaling and Steering of Diffusion Models PDF

Cannot Refute

Fast Language Generation through Discrete Diffusion Divergence Instruct

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] Ultra-fast language generation via discrete diffusion divergence instruct PDF

Contribution Analysis

DiDi-Instruct: a training-based distillation method for fast language generation

[17] Learnable Sampler Distillation for Discrete Diffusion Models PDF

[2] Distillation of discrete diffusion through dimensional correlations PDF

[3] Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation PDF

[4] Ultra-fast language generation via discrete diffusion divergence instruct PDF

[5] The diffusion duality PDF

[8] Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing PDF

[14] Beyond Autoregression: Fast LLMs via Self-Distillation Through Time PDF

[38] Compressed and Smooth Latent Space for Text Diffusion Modeling PDF

[39] Inference-Time Diffusion Model Distillation PDF

Theoretical framework based on integral KL-divergence minimization

[29] KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning PDF

[30] Score and Distribution Matching Policy: Advanced Accelerated Visuomotor Policies via Matched Distillation PDF

[31] Droid: Learning from offline heterogeneous demonstrations via reward-policy distillation PDF

[32] Towards searching for the best student in a Knowledge Distillation framework PDF

[33] ADPO: Anchored Direct Preference Optimization PDF

[34] On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning PDF

[35] A Distributional Approach to Controlled Text Generation PDF

[36] DISTILLATION AND GENERALIZATION IN DEEP REINFORCEMENT LEARNING PDF

Training and inference techniques: grouped reward normalization, intermediate-state matching, and reward-guided ancestral sampler

[4] Ultra-fast language generation via discrete diffusion divergence instruct PDF

[37] A General Framework for Inference-time Scaling and Steering of Diffusion Models PDF

Table of Contents