Bringing Stability to Diffusion: Decomposing and Reducing Variance of Training Masked Diffusion Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Masked Diffusion ModelsTraining VarianceTraining StabilityMask ScheduleMask Sampling

Masked diffusion models (MDMs) are a promising alternative to autoregressive models (ARMs), but they suffer from inherently much higher training variance. High variance leads to noisier gradient estimates and unstable optimization, so even equally strong pretrained MDMs and ARMs that are competitive at initialization often diverge after task-specific training, with MDMs falling far behind. Currently, there has been no theoretical explanation or systematic solution. In this paper, we derive the first decomposition of MDM training variance into three sources: {A} masking pattern noise, {B} masking rate noise, and {C} data noise -- while ARMs are only affected by {C}. This cleanly explains the fundamental training gap. Building on this foundation, we design six variance-reduction methods, including two core methods: (1) P-POTS, a Pareto-optimal $t$ -sampler that minimizes training variance by sampling harder $t$ values more often with appropriately smaller update steps, and (2) MIRROR, which uses negatively correlated samples to reduce {A}. Experiments show that, compared to standard MDM training, our methods improve accuracy by 7–8% on complex reasoning tasks, while simultaneously reducing run-to-run variability to near ARM levels, substantially narrowing the gap with strong ARM baselines; in most settings, even the best baseline method runs remain below the worst run of our method.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Reducing training variance in masked diffusion models. The field of diffusion-based generative modeling has expanded into multiple directions, each addressing variance reduction from distinct angles. The taxonomy organizes work into four main branches: discrete masked diffusion for language, continuous diffusion for vision, non-diffusion training paradigms, and specialized cross-domain applications. The first branch focuses on theoretical and algorithmic improvements for discrete token spaces, where masking schedules and variance decomposition play central roles, as seen in works like Simple and effective masked[1] and Information-Theoretic Discrete Diffusion[15]. The second branch tackles continuous image generation, exploring architectural innovations and training stabilization techniques exemplified by methods such as SD-MAE[18] and various inpainting approaches[17]. The third branch examines alternative training frameworks including autoregressive comparisons[24] and variance-reduced optimization methods[6], while the fourth branch applies these principles to domain-specific problems ranging from medical imaging[25] to remote sensing[20]. Within the discrete masked diffusion landscape, a particularly active line of research investigates the theoretical underpinnings of variance in training objectives. Bringing Stability to Diffusion[0] sits squarely in this theoretical analysis cluster, focusing on variance decomposition and stabilization mechanisms for masked language modeling. This work shares conceptual ground with Information-Theoretic Discrete Diffusion[15], which also examines fundamental properties of discrete diffusion processes, though the latter emphasizes information-theoretic bounds rather than direct variance reduction. Nearby efforts like LLaDA 15[3] and DiffuCoder[2] demonstrate practical applications of variance-aware training in code generation and language tasks, but they tend to prioritize empirical performance over the kind of formal variance analysis that characterizes Bringing Stability to Diffusion[0]. The central tension across these works involves balancing theoretical rigor with computational efficiency, as tighter variance bounds often require more complex training procedures.

Claimed Contributions

Systematic variance decomposition for masked diffusion model training

10 retrieved papers

The authors provide the first theoretical decomposition of training variance in masked diffusion models into three distinct sources: masking pattern noise from randomness in token masking, masking rate noise from variability across different masking rates, and data noise from inherent sample difficulty. This framework explains why MDMs exhibit fundamentally higher training variance than autoregressive models.

10 retrieved papers

P-POTS: Pareto-optimal timestep sampling method

Can Refute

10 retrieved papers

The authors introduce P-POTS, a parametric method that derives and implements the theoretically optimal masking rate sampler for minimizing training variance. It samples high-variance masking rates more frequently while using importance weighting to prevent destabilizing updates, achieving Pareto optimality among all unbiased samplers.

10 retrieved papers

Can Refute

MIRROR: Complementary masking for variance reduction

Can Refute

9 retrieved papers

The authors propose MIRROR, a method that constructs two complementary masked samples from the same input and masking rate, exploiting their negative correlation to explicitly reduce masking pattern noise. This technique reduces variance by at least half compared to standard training.

9 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[15] Information-Theoretic Discrete Diffusion PDF

Shin, Sangwoo, No, Albert (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic variance decomposition for masked diffusion model training

[2] DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation PDF

Cannot Refute

[3] LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models PDF

Cannot Refute

[4] The diffusion duality PDF

Cannot Refute

[9] Tree Reward-Aligned Search for TReASURe in Masked Diffusion Language Models PDF

Cannot Refute

[15] Information-Theoretic Discrete Diffusion PDF

Cannot Refute

[29] Variational diffusion models PDF

Cannot Refute

[30] Improving reasoning for diffusion language models via group diffusion policy optimization PDF

Cannot Refute

[31] MC-DiT: Contextual Enhancement via Clean-to-Clean Reconstruction for Masked Diffusion Models PDF

Cannot Refute

[32] Di o: Distilling masked diffusion models into one-step generator PDF

Cannot Refute

[33] Hyperspherical Latents Improve Continuous-Token Autoregressive Generation PDF

Cannot Refute

Contribution

P-POTS: Pareto-optimal timestep sampling method

[41] Adaptive Non-uniform Timestep Sampling for Accelerating Diffusion Model Training PDF

Can Refute

[48] Adaptive Non-Uniform Timestep Sampling for Diffusion Model Training PDF

Can Refute

[29] Variational diffusion models PDF

Cannot Refute

[42] Evaluating the design space of diffusion-based generative models PDF

Cannot Refute

[43] Timestep-aware correction for quantized diffusion models PDF

Cannot Refute

[44] Common diffusion noise schedules and sample steps are flawed PDF

Cannot Refute

[45] Efficient Diffusion Training via Min-SNR Weighting Strategy PDF

Cannot Refute

[46] Adaptive time-stepping schedules for diffusion models PDF

Cannot Refute

[47] A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training PDF

Cannot Refute

[49] Bidirectional Beta-Tuned Diffusion Model. PDF

Cannot Refute

Contribution

MIRROR: Complementary masking for variance reduction

[2] DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation PDF

Can Refute

[31] MC-DiT: Contextual Enhancement via Clean-to-Clean Reconstruction for Masked Diffusion Models PDF

Cannot Refute

[34] Masktwins: Dual-form complementary masking for domain-adaptive image segmentation PDF

Cannot Refute

[35] Efficient Progressive Image Compression with Variance-aware Masking PDF

Cannot Refute

[36] Bootstrapping Radiography Pre-training via Siamese Masked Vision-Language Modeling with Complementary Self-distillation PDF

Cannot Refute

[37] Volatility-Aware Masking Improves Performance and Efficiency of Pretrained EHR Foundation Models PDF

Cannot Refute

[38] Complementary Masked-Guided Meta-Learning for Domain Adaptive Nighttime Segmentation PDF

Cannot Refute

[39] Coefficient of Variation Masking: A Volatility-Aware Strategy for EHR Foundation Models PDF

Cannot Refute

[40] Variance-reduced Language Pretraining via a Mask Proposal Network PDF

Cannot Refute

Bringing Stability to Diffusion: Decomposing and Reducing Variance of Training Masked Diffusion Models

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[15] Information-Theoretic Discrete Diffusion PDF

Contribution Analysis

Systematic variance decomposition for masked diffusion model training

[2] DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation PDF

[3] LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models PDF

[4] The diffusion duality PDF

[9] Tree Reward-Aligned Search for TReASURe in Masked Diffusion Language Models PDF

[15] Information-Theoretic Discrete Diffusion PDF

[29] Variational diffusion models PDF

[30] Improving reasoning for diffusion language models via group diffusion policy optimization PDF

[31] MC-DiT: Contextual Enhancement via Clean-to-Clean Reconstruction for Masked Diffusion Models PDF

[32] Di o: Distilling masked diffusion models into one-step generator PDF

[33] Hyperspherical Latents Improve Continuous-Token Autoregressive Generation PDF

P-POTS: Pareto-optimal timestep sampling method

[41] Adaptive Non-uniform Timestep Sampling for Accelerating Diffusion Model Training PDF

[48] Adaptive Non-Uniform Timestep Sampling for Diffusion Model Training PDF

[29] Variational diffusion models PDF

[42] Evaluating the design space of diffusion-based generative models PDF

[43] Timestep-aware correction for quantized diffusion models PDF

[44] Common diffusion noise schedules and sample steps are flawed PDF

[45] Efficient Diffusion Training via Min-SNR Weighting Strategy PDF

[46] Adaptive time-stepping schedules for diffusion models PDF

[47] A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training PDF

[49] Bidirectional Beta-Tuned Diffusion Model. PDF

MIRROR: Complementary masking for variance reduction

[2] DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation PDF

[31] MC-DiT: Contextual Enhancement via Clean-to-Clean Reconstruction for Masked Diffusion Models PDF

[34] Masktwins: Dual-form complementary masking for domain-adaptive image segmentation PDF

[35] Efficient Progressive Image Compression with Variance-aware Masking PDF

[36] Bootstrapping Radiography Pre-training via Siamese Masked Vision-Language Modeling with Complementary Self-distillation PDF

[37] Volatility-Aware Masking Improves Performance and Efficiency of Pretrained EHR Foundation Models PDF

[38] Complementary Masked-Guided Meta-Learning for Domain Adaptive Nighttime Segmentation PDF

[39] Coefficient of Variation Masking: A Volatility-Aware Strategy for EHR Foundation Models PDF

[40] Variance-reduced Language Pretraining via a Mask Proposal Network PDF

Table of Contents