Fast Language Generation through Discrete Diffusion Divergence Instruct
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose DiDi-Instruct, a novel distillation framework that trains a few-step student model from a pre-trained masked discrete diffusion language model. This method achieves comparable or superior performance to the teacher model while enabling up to 64× acceleration in generation speed.
The authors develop a principled training method grounded in minimizing integral KL-divergence between student and teacher distributions. They reformulate the distillation objective using a policy gradient perspective, deriving a tractable update rule that uses an adversarial discriminator to estimate log-density ratios.
The authors introduce three key techniques to enhance the distillation process: grouped reward normalization for training stability, intermediate-state matching to prevent mode collapse, and a reward-guided ancestral sampler (RGAS) that improves inference quality through gradient tilting and candidate re-ranking.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[4] Ultra-fast language generation via discrete diffusion divergence instruct PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
DiDi-Instruct: a training-based distillation method for fast language generation
The authors propose DiDi-Instruct, a novel distillation framework that trains a few-step student model from a pre-trained masked discrete diffusion language model. This method achieves comparable or superior performance to the teacher model while enabling up to 64× acceleration in generation speed.
[17] Learnable Sampler Distillation for Discrete Diffusion Models PDF
[2] Distillation of discrete diffusion through dimensional correlations PDF
[3] Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation PDF
[4] Ultra-fast language generation via discrete diffusion divergence instruct PDF
[5] The diffusion duality PDF
[8] Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing PDF
[14] Beyond Autoregression: Fast LLMs via Self-Distillation Through Time PDF
[38] Compressed and Smooth Latent Space for Text Diffusion Modeling PDF
[39] Inference-Time Diffusion Model Distillation PDF
Theoretical framework based on integral KL-divergence minimization
The authors develop a principled training method grounded in minimizing integral KL-divergence between student and teacher distributions. They reformulate the distillation objective using a policy gradient perspective, deriving a tractable update rule that uses an adversarial discriminator to estimate log-density ratios.
[29] KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning PDF
[30] Score and Distribution Matching Policy: Advanced Accelerated Visuomotor Policies via Matched Distillation PDF
[31] Droid: Learning from offline heterogeneous demonstrations via reward-policy distillation PDF
[32] Towards searching for the best student in a Knowledge Distillation framework PDF
[33] ADPO: Anchored Direct Preference Optimization PDF
[34] On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning PDF
[35] A Distributional Approach to Controlled Text Generation PDF
[36] DISTILLATION AND GENERALIZATION IN DEEP REINFORCEMENT LEARNING PDF
Training and inference techniques: grouped reward normalization, intermediate-state matching, and reward-guided ancestral sampler
The authors introduce three key techniques to enhance the distillation process: grouped reward normalization for training stability, intermediate-state matching to prevent mode collapse, and a reward-guided ancestral sampler (RGAS) that improves inference quality through gradient tilting and candidate re-ranking.