Bringing Stability to Diffusion: Decomposing and Reducing Variance of Training Masked Diffusion Models

ICLR 2026 Conference SubmissionAnonymous Authors
Masked Diffusion ModelsTraining VarianceTraining StabilityMask ScheduleMask Sampling
Abstract:

Masked diffusion models (MDMs) are a promising alternative to autoregressive models (ARMs), but they suffer from inherently much higher training variance. High variance leads to noisier gradient estimates and unstable optimization, so even equally strong pretrained MDMs and ARMs that are competitive at initialization often diverge after task-specific training, with MDMs falling far behind. Currently, there has been no theoretical explanation or systematic solution. In this paper, we derive the first decomposition of MDM training variance into three sources: {A} masking pattern noise, {B} masking rate noise, and {C} data noise -- while ARMs are only affected by {C}. This cleanly explains the fundamental training gap. Building on this foundation, we design six variance-reduction methods, including two core methods: (1) P-POTS, a Pareto-optimal tt-sampler that minimizes training variance by sampling harder tt values more often with appropriately smaller update steps, and (2) MIRROR, which uses negatively correlated samples to reduce {A}. Experiments show that, compared to standard MDM training, our methods improve accuracy by 7–8% on complex reasoning tasks, while simultaneously reducing run-to-run variability to near ARM levels, substantially narrowing the gap with strong ARM baselines; in most settings, even the best baseline method runs remain below the worst run of our method.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
28
3
Claimed Contributions
29
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Reducing training variance in masked diffusion models. The field of diffusion-based generative modeling has expanded into multiple directions, each addressing variance reduction from distinct angles. The taxonomy organizes work into four main branches: discrete masked diffusion for language, continuous diffusion for vision, non-diffusion training paradigms, and specialized cross-domain applications. The first branch focuses on theoretical and algorithmic improvements for discrete token spaces, where masking schedules and variance decomposition play central roles, as seen in works like Simple and effective masked[1] and Information-Theoretic Discrete Diffusion[15]. The second branch tackles continuous image generation, exploring architectural innovations and training stabilization techniques exemplified by methods such as SD-MAE[18] and various inpainting approaches[17]. The third branch examines alternative training frameworks including autoregressive comparisons[24] and variance-reduced optimization methods[6], while the fourth branch applies these principles to domain-specific problems ranging from medical imaging[25] to remote sensing[20]. Within the discrete masked diffusion landscape, a particularly active line of research investigates the theoretical underpinnings of variance in training objectives. Bringing Stability to Diffusion[0] sits squarely in this theoretical analysis cluster, focusing on variance decomposition and stabilization mechanisms for masked language modeling. This work shares conceptual ground with Information-Theoretic Discrete Diffusion[15], which also examines fundamental properties of discrete diffusion processes, though the latter emphasizes information-theoretic bounds rather than direct variance reduction. Nearby efforts like LLaDA 15[3] and DiffuCoder[2] demonstrate practical applications of variance-aware training in code generation and language tasks, but they tend to prioritize empirical performance over the kind of formal variance analysis that characterizes Bringing Stability to Diffusion[0]. The central tension across these works involves balancing theoretical rigor with computational efficiency, as tighter variance bounds often require more complex training procedures.

Claimed Contributions

Systematic variance decomposition for masked diffusion model training

The authors provide the first theoretical decomposition of training variance in masked diffusion models into three distinct sources: masking pattern noise from randomness in token masking, masking rate noise from variability across different masking rates, and data noise from inherent sample difficulty. This framework explains why MDMs exhibit fundamentally higher training variance than autoregressive models.

10 retrieved papers
P-POTS: Pareto-optimal timestep sampling method

The authors introduce P-POTS, a parametric method that derives and implements the theoretically optimal masking rate sampler for minimizing training variance. It samples high-variance masking rates more frequently while using importance weighting to prevent destabilizing updates, achieving Pareto optimality among all unbiased samplers.

10 retrieved papers
Can Refute
MIRROR: Complementary masking for variance reduction

The authors propose MIRROR, a method that constructs two complementary masked samples from the same input and masking rate, exploiting their negative correlation to explicitly reduce masking pattern noise. This technique reduces variance by at least half compared to standard training.

9 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic variance decomposition for masked diffusion model training

The authors provide the first theoretical decomposition of training variance in masked diffusion models into three distinct sources: masking pattern noise from randomness in token masking, masking rate noise from variability across different masking rates, and data noise from inherent sample difficulty. This framework explains why MDMs exhibit fundamentally higher training variance than autoregressive models.

Contribution

P-POTS: Pareto-optimal timestep sampling method

The authors introduce P-POTS, a parametric method that derives and implements the theoretically optimal masking rate sampler for minimizing training variance. It samples high-variance masking rates more frequently while using importance weighting to prevent destabilizing updates, achieving Pareto optimality among all unbiased samplers.

Contribution

MIRROR: Complementary masking for variance reduction

The authors propose MIRROR, a method that constructs two complementary masked samples from the same input and masking rate, exploiting their negative correlation to explicitly reduce masking pattern noise. This technique reduces variance by at least half compared to standard training.