Self-Speculative Masked Diffusions

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

mask diffusiongenerative modelsspeculative decodingspeculative samplingLLM

We present self-speculative masked diffusions, a new class of masked diffusion generative models for discrete data that require significantly fewer function evaluations to generate samples. Standard masked diffusion models predict factorized logits over currently masked positions. A number of masked positions are then sampled, however, the factorization approximation means that sampling too many positions in one go leads to poor sample quality. As a result, many simulation steps and therefore neural network function evaluations are required to generate high-quality data. We reduce the computational burden by generating \emph{non-factorized} predictions over masked positions. This is achieved by modifying the final transformer attention mask from non-causal to causal, enabling draft token generation and parallel validation via a novel, model-integrated speculative sampling mechanism. This results in a non-factorized predictive distribution over masked positions in a single forward pass. We apply our method to GPT2 scale text modelling and protein sequences generation, finding that we can achieve a ~2x reduction in the required number of network forward passes relative to standard masked diffusion models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces self-speculative masked diffusion models that reduce function evaluations during discrete data generation by producing non-factorized predictions over masked positions. It resides in the 'Adaptive Unmasking and Scheduling' leaf of the taxonomy, which contains five papers total. This leaf focuses on dynamically determining which tokens to unmask based on confidence or learned policies. The presence of five sibling papers suggests moderate activity in adaptive scheduling approaches, indicating this is an established but not overcrowded research direction within the broader inference acceleration landscape.

The taxonomy reveals that adaptive unmasking sits alongside two related inference acceleration strategies: 'Parallel Token Generation and Conditional Independence' (three papers) and 'Iterative Refinement and Remasking' (two papers). The scope notes clarify that adaptive unmasking excludes fixed scheduling and remasking approaches, positioning this work as distinct from iterative correction methods. Neighboring branches address architectural modifications and training objectives, suggesting the field separates inference-time optimizations from model design improvements. The paper's hybrid causal-noncausal architecture bridges these categories, touching both inference strategy and architectural innovation.

Among twenty-five candidates examined across three contributions, none were identified as clearly refuting the proposed methods. The core self-speculative mechanism examined ten candidates with zero refutations, the hybrid architecture examined five candidates with zero refutations, and the theoretical characterization examined ten candidates with zero refutations. This limited search scope suggests that within the top-25 semantically similar papers, no direct prior work on model-integrated speculative sampling for masked diffusion was found. The absence of refutations across all contributions indicates potential novelty, though the search scale leaves open the possibility of relevant work outside this candidate set.

Based on the limited literature search, the work appears to occupy a relatively unexplored intersection between speculative decoding and masked diffusion. The taxonomy structure shows established work on adaptive scheduling and parallel generation, but the specific mechanism of causal attention switching for draft-and-verify within masked diffusion seems distinct from examined candidates. The analysis covers top-30 semantic matches and does not claim exhaustive coverage of all related inference acceleration techniques.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Accelerating masked diffusion models for discrete data generation. The field organizes around four main branches that address complementary aspects of making masked diffusion practical and effective. Inference Acceleration via Sampling Strategy Optimization focuses on reducing the number of denoising steps required at generation time through smarter unmasking schedules and adaptive policies, with works like Dilated Scheduling[19] and Learning Unmasking Policies[15] exploring how to strategically reveal tokens. Model Architecture and Training Improvements targets the underlying network design and learning objectives to boost efficiency from the ground up, exemplified by approaches such as Unified Discrete Diffusion[10] and Scaling Masked Text[4]. Theoretical Foundations and Analysis provides rigorous understanding of convergence properties and error bounds, while Application Domains and Task-Specific Adaptations demonstrates how these models extend to diverse settings including protein design, music generation, and code synthesis. Within the sampling acceleration branch, recent efforts have explored various trade-offs between generation quality and computational cost. Some methods like Remasking Inference Scaling[2] and Lookahead Unmasking[23] introduce iterative refinement or lookahead mechanisms to improve sample fidelity without proportionally increasing steps, while others such as Star-Shaped Masked[29] and KLASS[33] propose alternative masking geometries or knowledge-guided strategies. Self-Speculative Masked Diffusions[0] sits naturally among these adaptive unmasking approaches, emphasizing speculative token prediction to accelerate inference. Compared to neighboring works like Dilated Scheduling[19], which focuses on deterministic schedule design, or Learning Unmasking Policies[15], which learns data-driven policies, Self-Speculative Masked Diffusions[0] leverages the model's own predictions to guide dynamic unmasking decisions, offering a complementary perspective on how to balance speed and generation quality in the discrete diffusion setting.

Claimed Contributions

Self-speculative masked diffusion generative models

10 retrieved papers

The authors introduce a new class of masked diffusion models that generate non-factorized predictions over masked positions, reducing the number of neural network forward passes needed for high-quality sample generation by approximately 2× compared to standard masked diffusion models.

10 retrieved papers

Hybrid non-causal and causal transformer architecture

5 retrieved papers

The authors propose a novel hybrid transformer architecture combining non-causal blocks for draft generation with causal blocks for verification, enabling efficient speculative sampling within a single model through a permutation-informed design that ensures the causal target distribution strictly improves over the non-causal draft distribution.

5 retrieved papers

Theoretical characterization of self-speculative masked diffusion sampling

10 retrieved papers

The authors provide a theoretical analysis of their sampling procedure, deriving a tractable recursive decomposition for computing the distribution of generated samples and establishing an evidence lower bound on the model log-likelihood despite the shifting target distribution during generation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[15] Learning Unmasking Policies for Diffusion Language Models PDF

Metod Jazbec, Theo X. Olausson, Louis BÃ©thune, Pierre Ablin, Michael Kirchhof, JoÃ£o Monteiro, Victor Turrisi, Jason Ramapuram, Marco Cuturi (2025)

[19] Plan for Speed--Dilated Scheduling for Masked Diffusion Language Models PDF

Permuter, Haim, Omer Luxembourg, Nachmani, Eliya, H. Permuter, Eliya Nachmani (2025)

[29] Guided Star-Shaped Masked Diffusion PDF

Meshchaninov, Viacheslav, Balagansky, Nikita, Gavrilov, Daniil, Alanov, Aibek, Vetrov, Dmitry (2025)

[33] KLASS: KL-Guided Fast Inference in Masked Diffusion Models PDF

Seo Hyun Kim, Sunwoo Hong, Hojung Jung, Youngrok Park, Se-Young Yun (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Self-speculative masked diffusion generative models

[1] Di o: Distilling masked diffusion models into one-step generator PDF

Cannot Refute

[2] Remasking discrete diffusion models with inference-time scaling PDF

Cannot Refute

[5] Diffsound: Discrete Diffusion Model for Text-to-Sound Generation PDF

Cannot Refute

[9] Improving Text Style Transfer using Masked Diffusion Language Models with Inference-time Scaling PDF

Cannot Refute

[11] Beyond masked and unmasked: Discrete diffusion models via partial masking PDF

Cannot Refute

[13] Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model PDF

Cannot Refute

[20] Error Bounds and Optimal Schedules for Masked Diffusions with Factorized Approximations PDF

Cannot Refute

[38] Your absorbing discrete diffusion secretly models the conditional distributions of clean data PDF

Cannot Refute

[39] Path Planning for Masked Diffusion Model Sampling PDF

Cannot Refute

[40] Mdpo: Overcoming the training-inference divide of masked diffusion language models PDF

Cannot Refute

Contribution

Hybrid non-causal and causal transformer architecture

[41] Speculative Decoding with Big Little Decoder PDF

Cannot Refute

[42] Amphista: Bi-directional Multi-head Decoding for Accelerating LLM Inference PDF

Cannot Refute

[43] Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios PDF

Cannot Refute

[44] FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference PDF

Cannot Refute

[45] Submodular Approaches for Citation Recommendation PDF

Cannot Refute

Contribution

Theoretical characterization of self-speculative masked diffusion sampling

[8] DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation PDF

Cannot Refute

[11] Beyond masked and unmasked: Discrete diffusion models via partial masking PDF

Cannot Refute

[34] Simple and Effective Masked Diffusion Language Models PDF

Cannot Refute

[39] Path Planning for Masked Diffusion Model Sampling PDF

Cannot Refute

[46] Simplified and generalized masked diffusion for discrete data PDF

Cannot Refute

[47] Large Language Diffusion Models PDF

Cannot Refute

[48] Optimal Inference Schedules for Masked Diffusion Models PDF

Cannot Refute

[49] Latent diffusion models with masked autoencoders PDF

Cannot Refute

[50] Generalized Interpolating Discrete Diffusion PDF

Cannot Refute

[51] A Variational Perspective on Solving Inverse Problems with Diffusion Models PDF

Cannot Refute

Self-Speculative Masked Diffusions

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[15] Learning Unmasking Policies for Diffusion Language Models PDF

[19] Plan for Speed--Dilated Scheduling for Masked Diffusion Language Models PDF

[29] Guided Star-Shaped Masked Diffusion PDF

[33] KLASS: KL-Guided Fast Inference in Masked Diffusion Models PDF

Contribution Analysis

Self-speculative masked diffusion generative models

[1] Di o: Distilling masked diffusion models into one-step generator PDF

[2] Remasking discrete diffusion models with inference-time scaling PDF

[5] Diffsound: Discrete Diffusion Model for Text-to-Sound Generation PDF

[9] Improving Text Style Transfer using Masked Diffusion Language Models with Inference-time Scaling PDF

[11] Beyond masked and unmasked: Discrete diffusion models via partial masking PDF

[13] Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model PDF

[20] Error Bounds and Optimal Schedules for Masked Diffusions with Factorized Approximations PDF

[38] Your absorbing discrete diffusion secretly models the conditional distributions of clean data PDF

[39] Path Planning for Masked Diffusion Model Sampling PDF

[40] Mdpo: Overcoming the training-inference divide of masked diffusion language models PDF

Hybrid non-causal and causal transformer architecture

[41] Speculative Decoding with Big Little Decoder PDF

[42] Amphista: Bi-directional Multi-head Decoding for Accelerating LLM Inference PDF

[43] Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios PDF

[44] FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference PDF

[45] Submodular Approaches for Citation Recommendation PDF

Theoretical characterization of self-speculative masked diffusion sampling

[8] DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation PDF

[11] Beyond masked and unmasked: Discrete diffusion models via partial masking PDF

[34] Simple and Effective Masked Diffusion Language Models PDF

[39] Path Planning for Masked Diffusion Model Sampling PDF

[46] Simplified and generalized masked diffusion for discrete data PDF

[47] Large Language Diffusion Models PDF

[48] Optimal Inference Schedules for Masked Diffusion Models PDF

[49] Latent diffusion models with masked autoencoders PDF

[50] Generalized Interpolating Discrete Diffusion PDF

[51] A Variational Perspective on Solving Inverse Problems with Diffusion Models PDF

Table of Contents