Soft-Masked Diffusion Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Masked diffusion language modelscontinuous feedbackcode generation

Diffusion models have demonstrated strong potential in language modeling, offering various advantages over traditional autoregressive approaches. Their ability to generate and revise entire responses in parallel enables faster generation and built-in self-correction mechanisms. Most modern diffusion-based language models employ masked diffusion, where decoding involves iteratively processing masked tokens based on a binary decision: either retaining the mask or replacing it with the predicted token. However, this binary choice discards valuable predictive information when the mask is retained. To address this limitation, we introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top-k predicted tokens from the previous decoding step, for each retained mask. This provides the model with a more informative prior, preserving context from earlier computations and allowing partial information about masked tokens to propagate beyond a single step. We propose a training methodology that efficiently adapts masked diffusion language models to incorporate SM. We demonstrate that training a 169M parameter model from scratch with SM yields superior perplexity and MAUVE scores compared to binary masking baselines. Similarly, a pretrained model can be enhanced with SM through continued pretraining. Finally, we finetune two state-of-the-art diffusion models, Dream-7B and Dream-Coder-7B, with SM. SM consistently improves performance across multiple coding benchmarks, particularly in high-throughput settings.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces soft-masking for masked diffusion language models, blending mask embeddings with top-k predicted token embeddings to preserve partial information across decoding steps. Within the taxonomy, it occupies the 'Soft-Masking and Continuous Embedding Blending' leaf under 'Hybrid Continuous-Discrete Diffusion'. Notably, this leaf contains only the original paper itself—no sibling papers appear in the same category. This positioning suggests the work addresses a relatively sparse research direction within the broader hybrid continuous-discrete diffusion landscape, which itself contains only three distinct leaves across the entire taxonomy.

The taxonomy reveals that neighboring work explores related but distinct hybrid strategies. The sibling leaf 'Continuous Latent Augmentation for Discrete Diffusion' focuses on paired continuous latent diffusion to avoid information voids, while 'Coevolutionary Continuous-Discrete Diffusion' examines jointly evolving processes for enhanced expressivity. The parent branch 'Hybrid Continuous-Discrete Diffusion' sits alongside 'Discrete State-Space Diffusion Models' (which includes binary masking approaches) and 'Semi-Autoregressive and Block-Based Diffusion'. The taxonomy's scope and exclude notes clarify that soft-masking's continuous embedding blending distinguishes it from both pure discrete binary masking and separate latent-space augmentation methods.

Among the three contributions analyzed, the core soft-masking mechanism examined five candidates and found two potentially refutable prior works, suggesting some overlap with existing hybrid approaches. The training methodology and empirical demonstration contributions each examined ten candidates with no clear refutations identified. The limited search scope—25 total candidates examined across all contributions—means these statistics reflect top-K semantic matches and citation expansion rather than exhaustive coverage. The soft-masking mechanism appears to have the most substantial prior work among the three contributions, while the training methodology and scaling demonstrations show less direct overlap within the examined candidate set.

Based on the limited search scope of 25 candidates, the work appears to occupy a relatively unexplored niche within hybrid continuous-discrete diffusion, though the core soft-masking idea shows some overlap with existing hybrid augmentation strategies. The taxonomy structure indicates this is a sparse research direction with few directly comparable papers, but the analysis does not cover the full breadth of diffusion language modeling literature beyond top-K semantic matches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Improving masked diffusion language models through continuous feedback mechanisms. The field of diffusion-based language modeling has evolved into several distinct branches that address complementary challenges. At the highest level, work divides into Diffusion Process Design and State-Space Formulations, which explores how to bridge discrete text and continuous diffusion dynamics; Guidance and Control Mechanisms, which develops methods to steer generation toward desired attributes; Pre-training and Transfer Learning, which investigates how to leverage large-scale data and adapt models efficiently; Domain-Specific Applications, which tailors diffusion approaches to specialized tasks; and Surveys and Comparative Studies, which synthesize emerging trends. Within process design, a particularly active line examines hybrid continuous-discrete formulations—such as SSD-LM[1] and Score-based Continuous-time[9]—that blend token-level masking with smooth latent trajectories, while guidance research ranges from derivative-free methods like Derivative-Free Guidance[6] to classifier-based steering as in Classifiers Guided Controllable[18]. A central tension across these branches concerns how to maintain both generation quality and controllability without sacrificing efficiency. Hybrid approaches like Continuously Augmented Discrete[3] and Coevolutionary Continuous Discrete[4] explore co-evolving discrete and continuous representations, whereas iterative refinement methods such as Review Remask Refine[8] focus on feedback loops that progressively improve outputs. Soft-Masked Diffusion[0] sits squarely within the hybrid continuous-discrete cluster, emphasizing soft-masking and continuous embedding blending to enable smoother feedback integration compared to hard token replacements. This contrasts with purely discrete masking schemes and aligns closely with works like Continuously Augmented Discrete[3], which similarly augment discrete states with continuous signals, though Soft-Masked Diffusion[0] places greater emphasis on end-to-end differentiable feedback. The interplay between process design and guidance remains an open frontier, as researchers seek unified frameworks that combine expressive state spaces with flexible control.

Claimed Contributions

Soft-masking mechanism for masked diffusion language models

Can Refute

5 retrieved papers

The authors propose soft-masking, a mechanism that enriches masked tokens by creating a convex combination of the mask token and top-k predicted tokens weighted by confidence scores. This allows partial information about masked tokens to propagate across decoding steps rather than being discarded through binary masking decisions.

5 retrieved papers

Can Refute

Training methodology for soft-masking integration

10 retrieved papers

The authors introduce a two-pass training procedure (Algorithm 1) that enables MDLMs to learn soft-masking parameters concurrently with backbone parameters. This parallelizable method approximates the feedback-augmented marginal distribution and allows efficient adaptation of existing models.

10 retrieved papers

Demonstration of soft-masking benefits across model scales and tasks

10 retrieved papers

The authors show that soft-masking improves performance when training small models from scratch on language modeling tasks and when finetuning large-scale models (Dream-7B and Dream-Coder-7B) on coding benchmarks, particularly in high-throughput settings with limited decoding iterations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Soft-masking mechanism for masked diffusion language models

[23] Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States PDF

Can Refute

[24] A Cheaper and Better Diffusion Language Model with Soft-Masked Noise PDF

Can Refute

[20] Diffusion forcing: Next-token prediction meets full-sequence diffusion PDF

Cannot Refute

[21] Unified auto-encoding with masked diffusion PDF

Cannot Refute

[22] Conditional Discrete Diffusion Language Model PDF

Cannot Refute

Contribution

Training methodology for soft-masking integration

[2] Diffusion-based Large Language Models Survey PDF

Cannot Refute

[34] MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO PDF

Cannot Refute

[35] DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation PDF

Cannot Refute

[36] d1: Scaling reasoning in diffusion large language models via reinforcement learning PDF

Cannot Refute

[37] DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models PDF

Cannot Refute

[38] Reinforcing the diffusion chain of lateral thought with diffusion language models PDF

Cannot Refute

[39] Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model PDF

Cannot Refute

[40] Dream-Coder 7B: An Open Diffusion Language Model for Code PDF

Cannot Refute

[41] Feedback Efficient Online Fine-Tuning of Diffusion Models PDF

Cannot Refute

[42] Ultra-fast language generation via discrete diffusion divergence instruct PDF

Cannot Refute

Contribution

Demonstration of soft-masking benefits across model scales and tasks

[3] Continuously augmented discrete diffusion model for categorical generative modeling PDF

Cannot Refute

[25] Collaborative Training of GANs in Continuous and Discrete Spaces for Text Generation PDF

Cannot Refute

[26] Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings PDF

Cannot Refute

[27] Why Masking Diffusion Works: Condition on the Jump Schedule for Improved Discrete Diffusion PDF

Cannot Refute

[28] LLM Explainability via Attributive Masking Learning PDF

Cannot Refute

[29] Controlled Text Generation as Continuous Optimization with Multiple Constraints PDF

Cannot Refute

[30] Balancing Speciality and Versatility: a Coarse to Fine Framework for Supervised Fine-tuning Large Language Model PDF

Cannot Refute

[31] Code Representation Learning At Scale PDF

Cannot Refute

[32] Adapting a language model while preserving its general knowledge PDF

Cannot Refute

[33] NAPG: Non-autoregressive program generation for hybrid tabular-textual question answering PDF

Cannot Refute

Soft-Masked Diffusion Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Soft-masking mechanism for masked diffusion language models

[23] Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States PDF

[24] A Cheaper and Better Diffusion Language Model with Soft-Masked Noise PDF

[20] Diffusion forcing: Next-token prediction meets full-sequence diffusion PDF

[21] Unified auto-encoding with masked diffusion PDF

[22] Conditional Discrete Diffusion Language Model PDF

Training methodology for soft-masking integration

[2] Diffusion-based Large Language Models Survey PDF

[34] MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO PDF

[35] DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation PDF

[36] d1: Scaling reasoning in diffusion large language models via reinforcement learning PDF

[37] DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models PDF

[38] Reinforcing the diffusion chain of lateral thought with diffusion language models PDF

[39] Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model PDF

[40] Dream-Coder 7B: An Open Diffusion Language Model for Code PDF

[41] Feedback Efficient Online Fine-Tuning of Diffusion Models PDF

[42] Ultra-fast language generation via discrete diffusion divergence instruct PDF

Demonstration of soft-masking benefits across model scales and tasks

[3] Continuously augmented discrete diffusion model for categorical generative modeling PDF

[25] Collaborative Training of GANs in Continuous and Discrete Spaces for Text Generation PDF

[26] Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings PDF

[27] Why Masking Diffusion Works: Condition on the Jump Schedule for Improved Discrete Diffusion PDF

[28] LLM Explainability via Attributive Masking Learning PDF

[29] Controlled Text Generation as Continuous Optimization with Multiple Constraints PDF

[30] Balancing Speciality and Versatility: a Coarse to Fine Framework for Supervised Fine-tuning Large Language Model PDF

[31] Code Representation Learning At Scale PDF

[32] Adapting a language model while preserving its general knowledge PDF

[33] NAPG: Non-autoregressive program generation for hybrid tabular-textual question answering PDF

Table of Contents