Fast Language Generation through Discrete Diffusion Divergence Instruct

ICLR 2026 Conference SubmissionAnonymous Authors
discrete diffusion modelsmasked diffusion modelsdistillationintegral KL divergencelarge language modelsgenerative modeling
Abstract:

Fast and high-quality language generation is the holy grail that people pursue in the age of AI. In this work, we introduce Discrete Diffusion Divergence Instruct (DiDi-Instruct), a training-based method that initializes from a pre-trained (masked) discrete diffusion language model (dLLM) and distills a few-step student for fast generation. The resulting DiDi-Instruct model achieves comparable or superior performance to its dLLM teacher and the GPT-2 baseline while enabling up to 64×\times acceleration. The theoretical foundation of DiDi-Instruct is a novel framework based on integral KL-divergence minimization, which yields a practical training algorithm. We further introduce grouped reward normalization, intermediate-state matching, and the reward-guided ancestral sampler that significantly improve training stability, model coverage, and inference quality. On OpenWebText, DiDi-Instruct achieves perplexity from 62.2 (8 NFEs) to 18.4 (128 NFEs), which outperforms prior accelerated dLLMs and GPT-2 baseline. These gains come with a negligible entropy loss (around 11%) and reduce additional training wall-clock time by more than 20×20\times compared to competing dLLM distillation methods. We further validate the robustness and effectiveness of DiDi-Instruct through extensive ablation studies, model scaling, downstream tasks, and the generation of discrete protein sequences. In conclusion, DiDi-Instruct is an efficient yet effective distillation method, enabling language generation in the blink of an eye. We will release our code and models along with the paper.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
28
3
Claimed Contributions
19
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: accelerating discrete diffusion language models through distillation. The field has evolved into several distinct branches that reflect both methodological diversity and domain specialization. Discrete Diffusion Distillation Methods focus on adapting distillation techniques specifically for language and discrete token spaces, often leveraging KL-divergence objectives or specialized sampling strategies tailored to categorical distributions. Image Diffusion Distillation Methods encompass a rich body of work on accelerating continuous diffusion models for visual generation, exploring progressive distillation schemes like Progressive Distillation[7], score-matching approaches such as Score Identity Distillation[3], and one-step or few-step generators including Swiftbrush[1] and Imagine Flash[16]. Cross-Domain and Unified Distillation Frameworks attempt to bridge discrete and continuous settings, proposing architectures or training paradigms that generalize across modalities. Domain-Specific Distillation Applications address tailored challenges in areas like video, audio, or 3D generation, while Survey and Overview papers such as Diffusion Acceleration Survey[21] synthesize emerging trends and open questions across these branches. Within the discrete distillation landscape, a central tension revolves around balancing sample quality, inference speed, and training stability when moving from continuous to categorical spaces. Diffusion Divergence Instruct[0] sits squarely in the KL-Divergence Based Distillation cluster, emphasizing divergence minimization to compress multi-step discrete diffusion into fewer steps for language modeling. Its closest neighbor, Ultra-fast Divergence Instruct[4], shares this KL-centric philosophy but pushes toward even more aggressive step reduction. In contrast, works like Discrete Diffusion Forcing[8] and Absorbing Discrete Diffusion[20] explore alternative parameterizations or absorbing-state dynamics that sidestep some gradient estimation challenges inherent in categorical distillation. Meanwhile, methods such as DKDM[6] and Diffusion Duality[5] investigate dual formulations or knowledge transfer mechanisms that complement divergence-based objectives. The original paper thus contributes to an active subfield where researchers are refining how to faithfully distill discrete generative processes without sacrificing the expressiveness that makes diffusion models attractive for language generation.

Claimed Contributions

DiDi-Instruct: a training-based distillation method for fast language generation

The authors propose DiDi-Instruct, a novel distillation framework that trains a few-step student model from a pre-trained masked discrete diffusion language model. This method achieves comparable or superior performance to the teacher model while enabling up to 64× acceleration in generation speed.

9 retrieved papers
Can Refute
Theoretical framework based on integral KL-divergence minimization

The authors develop a principled training method grounded in minimizing integral KL-divergence between student and teacher distributions. They reformulate the distillation objective using a policy gradient perspective, deriving a tractable update rule that uses an adversarial discriminator to estimate log-density ratios.

8 retrieved papers
Training and inference techniques: grouped reward normalization, intermediate-state matching, and reward-guided ancestral sampler

The authors introduce three key techniques to enhance the distillation process: grouped reward normalization for training stability, intermediate-state matching to prevent mode collapse, and a reward-guided ancestral sampler (RGAS) that improves inference quality through gradient tilting and candidate re-ranking.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DiDi-Instruct: a training-based distillation method for fast language generation

The authors propose DiDi-Instruct, a novel distillation framework that trains a few-step student model from a pre-trained masked discrete diffusion language model. This method achieves comparable or superior performance to the teacher model while enabling up to 64× acceleration in generation speed.

Contribution

Theoretical framework based on integral KL-divergence minimization

The authors develop a principled training method grounded in minimizing integral KL-divergence between student and teacher distributions. They reformulate the distillation objective using a policy gradient perspective, deriving a tractable update rule that uses an adversarial discriminator to estimate log-density ratios.

Contribution

Training and inference techniques: grouped reward normalization, intermediate-state matching, and reward-guided ancestral sampler

The authors introduce three key techniques to enhance the distillation process: grouped reward normalization for training stability, intermediate-state matching to prevent mode collapse, and a reward-guided ancestral sampler (RGAS) that improves inference quality through gradient tilting and candidate re-ranking.