Soft-Masked Diffusion Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
Masked diffusion language modelscontinuous feedbackcode generation
Abstract:

Diffusion models have demonstrated strong potential in language modeling, offering various advantages over traditional autoregressive approaches. Their ability to generate and revise entire responses in parallel enables faster generation and built-in self-correction mechanisms. Most modern diffusion-based language models employ masked diffusion, where decoding involves iteratively processing masked tokens based on a binary decision: either retaining the mask or replacing it with the predicted token. However, this binary choice discards valuable predictive information when the mask is retained. To address this limitation, we introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top-k predicted tokens from the previous decoding step, for each retained mask. This provides the model with a more informative prior, preserving context from earlier computations and allowing partial information about masked tokens to propagate beyond a single step. We propose a training methodology that efficiently adapts masked diffusion language models to incorporate SM. We demonstrate that training a 169M parameter model from scratch with SM yields superior perplexity and MAUVE scores compared to binary masking baselines. Similarly, a pretrained model can be enhanced with SM through continued pretraining. Finally, we finetune two state-of-the-art diffusion models, Dream-7B and Dream-Coder-7B, with SM. SM consistently improves performance across multiple coding benchmarks, particularly in high-throughput settings.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces soft-masking for masked diffusion language models, blending mask embeddings with top-k predicted token embeddings to preserve partial information across decoding steps. Within the taxonomy, it occupies the 'Soft-Masking and Continuous Embedding Blending' leaf under 'Hybrid Continuous-Discrete Diffusion'. Notably, this leaf contains only the original paper itself—no sibling papers appear in the same category. This positioning suggests the work addresses a relatively sparse research direction within the broader hybrid continuous-discrete diffusion landscape, which itself contains only three distinct leaves across the entire taxonomy.

The taxonomy reveals that neighboring work explores related but distinct hybrid strategies. The sibling leaf 'Continuous Latent Augmentation for Discrete Diffusion' focuses on paired continuous latent diffusion to avoid information voids, while 'Coevolutionary Continuous-Discrete Diffusion' examines jointly evolving processes for enhanced expressivity. The parent branch 'Hybrid Continuous-Discrete Diffusion' sits alongside 'Discrete State-Space Diffusion Models' (which includes binary masking approaches) and 'Semi-Autoregressive and Block-Based Diffusion'. The taxonomy's scope and exclude notes clarify that soft-masking's continuous embedding blending distinguishes it from both pure discrete binary masking and separate latent-space augmentation methods.

Among the three contributions analyzed, the core soft-masking mechanism examined five candidates and found two potentially refutable prior works, suggesting some overlap with existing hybrid approaches. The training methodology and empirical demonstration contributions each examined ten candidates with no clear refutations identified. The limited search scope—25 total candidates examined across all contributions—means these statistics reflect top-K semantic matches and citation expansion rather than exhaustive coverage. The soft-masking mechanism appears to have the most substantial prior work among the three contributions, while the training methodology and scaling demonstrations show less direct overlap within the examined candidate set.

Based on the limited search scope of 25 candidates, the work appears to occupy a relatively unexplored niche within hybrid continuous-discrete diffusion, though the core soft-masking idea shows some overlap with existing hybrid augmentation strategies. The taxonomy structure indicates this is a sparse research direction with few directly comparable papers, but the analysis does not cover the full breadth of diffusion language modeling literature beyond top-K semantic matches.

Taxonomy

Core-task Taxonomy Papers
19
3
Claimed Contributions
25
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Improving masked diffusion language models through continuous feedback mechanisms. The field of diffusion-based language modeling has evolved into several distinct branches that address complementary challenges. At the highest level, work divides into Diffusion Process Design and State-Space Formulations, which explores how to bridge discrete text and continuous diffusion dynamics; Guidance and Control Mechanisms, which develops methods to steer generation toward desired attributes; Pre-training and Transfer Learning, which investigates how to leverage large-scale data and adapt models efficiently; Domain-Specific Applications, which tailors diffusion approaches to specialized tasks; and Surveys and Comparative Studies, which synthesize emerging trends. Within process design, a particularly active line examines hybrid continuous-discrete formulations—such as SSD-LM[1] and Score-based Continuous-time[9]—that blend token-level masking with smooth latent trajectories, while guidance research ranges from derivative-free methods like Derivative-Free Guidance[6] to classifier-based steering as in Classifiers Guided Controllable[18]. A central tension across these branches concerns how to maintain both generation quality and controllability without sacrificing efficiency. Hybrid approaches like Continuously Augmented Discrete[3] and Coevolutionary Continuous Discrete[4] explore co-evolving discrete and continuous representations, whereas iterative refinement methods such as Review Remask Refine[8] focus on feedback loops that progressively improve outputs. Soft-Masked Diffusion[0] sits squarely within the hybrid continuous-discrete cluster, emphasizing soft-masking and continuous embedding blending to enable smoother feedback integration compared to hard token replacements. This contrasts with purely discrete masking schemes and aligns closely with works like Continuously Augmented Discrete[3], which similarly augment discrete states with continuous signals, though Soft-Masked Diffusion[0] places greater emphasis on end-to-end differentiable feedback. The interplay between process design and guidance remains an open frontier, as researchers seek unified frameworks that combine expressive state spaces with flexible control.

Claimed Contributions

Soft-masking mechanism for masked diffusion language models

The authors propose soft-masking, a mechanism that enriches masked tokens by creating a convex combination of the mask token and top-k predicted tokens weighted by confidence scores. This allows partial information about masked tokens to propagate across decoding steps rather than being discarded through binary masking decisions.

5 retrieved papers
Can Refute
Training methodology for soft-masking integration

The authors introduce a two-pass training procedure (Algorithm 1) that enables MDLMs to learn soft-masking parameters concurrently with backbone parameters. This parallelizable method approximates the feedback-augmented marginal distribution and allows efficient adaptation of existing models.

10 retrieved papers
Demonstration of soft-masking benefits across model scales and tasks

The authors show that soft-masking improves performance when training small models from scratch on language modeling tasks and when finetuning large-scale models (Dream-7B and Dream-Coder-7B) on coding benchmarks, particularly in high-throughput settings with limited decoding iterations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Soft-masking mechanism for masked diffusion language models

The authors propose soft-masking, a mechanism that enriches masked tokens by creating a convex combination of the mask token and top-k predicted tokens weighted by confidence scores. This allows partial information about masked tokens to propagate across decoding steps rather than being discarded through binary masking decisions.

Contribution

Training methodology for soft-masking integration

The authors introduce a two-pass training procedure (Algorithm 1) that enables MDLMs to learn soft-masking parameters concurrently with backbone parameters. This parallelizable method approximates the feedback-augmented marginal distribution and allows efficient adaptation of existing models.

Contribution

Demonstration of soft-masking benefits across model scales and tasks

The authors show that soft-masking improves performance when training small models from scratch on language modeling tasks and when finetuning large-scale models (Dream-7B and Dream-Coder-7B) on coding benchmarks, particularly in high-throughput settings with limited decoding iterations.