Bringing Stability to Diffusion: Decomposing and Reducing Variance of Training Masked Diffusion Models
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors provide the first theoretical decomposition of training variance in masked diffusion models into three distinct sources: masking pattern noise from randomness in token masking, masking rate noise from variability across different masking rates, and data noise from inherent sample difficulty. This framework explains why MDMs exhibit fundamentally higher training variance than autoregressive models.
The authors introduce P-POTS, a parametric method that derives and implements the theoretically optimal masking rate sampler for minimizing training variance. It samples high-variance masking rates more frequently while using importance weighting to prevent destabilizing updates, achieving Pareto optimality among all unbiased samplers.
The authors propose MIRROR, a method that constructs two complementary masked samples from the same input and masking rate, exploiting their negative correlation to explicitly reduce masking pattern noise. This technique reduces variance by at least half compared to standard training.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[15] Information-Theoretic Discrete Diffusion PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Systematic variance decomposition for masked diffusion model training
The authors provide the first theoretical decomposition of training variance in masked diffusion models into three distinct sources: masking pattern noise from randomness in token masking, masking rate noise from variability across different masking rates, and data noise from inherent sample difficulty. This framework explains why MDMs exhibit fundamentally higher training variance than autoregressive models.
[2] DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation PDF
[3] LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models PDF
[4] The diffusion duality PDF
[9] Tree Reward-Aligned Search for TReASURe in Masked Diffusion Language Models PDF
[15] Information-Theoretic Discrete Diffusion PDF
[29] Variational diffusion models PDF
[30] Improving reasoning for diffusion language models via group diffusion policy optimization PDF
[31] MC-DiT: Contextual Enhancement via Clean-to-Clean Reconstruction for Masked Diffusion Models PDF
[32] Di o: Distilling masked diffusion models into one-step generator PDF
[33] Hyperspherical Latents Improve Continuous-Token Autoregressive Generation PDF
P-POTS: Pareto-optimal timestep sampling method
The authors introduce P-POTS, a parametric method that derives and implements the theoretically optimal masking rate sampler for minimizing training variance. It samples high-variance masking rates more frequently while using importance weighting to prevent destabilizing updates, achieving Pareto optimality among all unbiased samplers.
[41] Adaptive Non-uniform Timestep Sampling for Accelerating Diffusion Model Training PDF
[48] Adaptive Non-Uniform Timestep Sampling for Diffusion Model Training PDF
[29] Variational diffusion models PDF
[42] Evaluating the design space of diffusion-based generative models PDF
[43] Timestep-aware correction for quantized diffusion models PDF
[44] Common diffusion noise schedules and sample steps are flawed PDF
[45] Efficient Diffusion Training via Min-SNR Weighting Strategy PDF
[46] Adaptive time-stepping schedules for diffusion models PDF
[47] A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training PDF
[49] Bidirectional Beta-Tuned Diffusion Model. PDF
MIRROR: Complementary masking for variance reduction
The authors propose MIRROR, a method that constructs two complementary masked samples from the same input and masking rate, exploiting their negative correlation to explicitly reduce masking pattern noise. This technique reduces variance by at least half compared to standard training.