Parallel Sampling from Masked Diffusion Models via Conditional Independence Testing

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

masked diffusion modelslanguage modelsinference

Masked diffusion models (MDMs) offer a compelling alternative to autoregres- sive models (ARMs) for discrete text generation because they enable parallel token sampling, rather than sequential, left-to-right generation. This means po- tentially much faster inference. However, effective parallel sampling faces two competing requirements: (i) simultaneously updated tokens must be conditionally independent, and (ii) updates should prioritise high-confidence predictions. These goals conflict because high-confidence predictions often cluster and depend on each other, opportunities for parallel updates.

We present PUNT, a model-agnostic sampler that reconciles this trade-off. Our method identifies token dependencies and removes lower-confidence tokens from conflicting groups. This produces sets of indices for unmasking that satisfy both independence and confidence criteria. Our approach ensures improved parallel unmasking through approximate conditional independence testing.

Our experiments show that PUNT delivers a superior trade-off between accuracy and compute when compared to other strong training-free baselines, especially for generation of longer sequences. On the IFEval benchmark, it achieves up to 16% higher accuracy over baseline methods, including sequential generation (one-by- one). These gains hold across different values of hyperparameters, mitigating the need for brittle hyperparameter tuning. Moreover, we observe that PUNT induces an emergent hierarchical generation strategy, where the model first establishes high-level paragraph structure before local refinement, suggesting a planning-like generation process that contributes to strong alignment performance.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces PUNT, a model-agnostic sampler that addresses token dependency conflicts during parallel unmasking in masked diffusion models. It resides in the 'Inference-Time Sampling Policies' leaf of the taxonomy, which contains only three papers total. This leaf sits within the broader 'Sampling Strategies and Scheduling' branch, indicating a relatively sparse research direction focused on inference-only methods that do not require training modifications. The small sibling count suggests this specific problem space—balancing conditional independence and confidence during parallel sampling—has received limited prior attention compared to other branches like application-specific architectures or core model formulations.

The taxonomy reveals neighboring work in 'Training-Aware Sampling Integration' (two papers on learned unmasking policies) and 'Speculative and Multi-Token Decoding' (two papers on multi-token prediction). PUNT diverges from training-aware methods by operating purely at inference time, avoiding the need for path-aligned training or learned policies. It also differs from speculative decoding approaches, which typically predict and validate tokens in a draft-verify framework, whereas PUNT explicitly tests for contextual independence to construct safe parallel unmasking sets. The taxonomy's scope notes clarify that PUNT's inference-only nature excludes it from training-integrated methods, while its focus on dependency resolution distinguishes it from single-token scheduling heuristics.

Among the fifteen candidates examined, none clearly refute any of PUNT's three contributions. The first contribution (contextual independence testing for parallel unmasking) examined five candidates with zero refutations; the second (recursive binary encoding algorithm) examined six with zero refutations; the third (contextual independence criterion) examined four with zero refutations. This limited search scope—fifteen papers from semantic retrieval—suggests that within the examined neighborhood, no prior work explicitly combines dependency testing with confidence-based parallel unmasking in this manner. However, the small candidate pool means the analysis cannot rule out relevant work outside the top-K semantic matches or in adjacent research communities.

Given the sparse taxonomy leaf and absence of refutations among fifteen examined candidates, PUNT appears to occupy a relatively unexplored niche within inference-time sampling policies. The analysis is constrained by the limited search scope and does not cover exhaustive citation networks or domain-specific venues. The novelty assessment reflects what is visible within top-K semantic neighbors, not a comprehensive field survey.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: parallel sampling from masked diffusion models. The field centers on generating discrete or continuous data by iteratively unmasking tokens or patches, enabling faster inference than traditional autoregressive methods. The taxonomy reveals four main branches: Sampling Strategies and Scheduling explores how to choose which tokens to unmask at each step, ranging from fixed cosine schedules (as in MaskGIT[12]) to learned policies (Learning Unmasking Policies[6]) and dilated or optimal schedules (Dilated Scheduling[5], Optimal Inference Schedules[8]); Model Architectures and Formulations examines the underlying probabilistic frameworks, including simplified formulations (Simplified Masked Diffusion[1]), variational objectives (Variational Masked Diffusion[15]), and self-speculative decoding variants (Self-speculative Masked[2]); Theoretical Analysis and Scaling investigates convergence guarantees, scaling laws, and compute-optimal trade-offs (Scaling Masked Diffusion[4], No Compute Left[19]); and Application-Specific Architectures tailors these models to domains such as motion synthesis (Masked Motion Model[3]), medical imaging (Unified Multi-modal MRI[21]), and recommendation systems (Masked Diffusion Recommendation[27]). A particularly active line of work focuses on inference-time sampling policies, where researchers seek to balance generation quality and speed by optimizing unmasking schedules. Some studies propose hand-crafted or theoretically motivated schedules (Dilated Scheduling[5], Optimal Inference Schedules[8]), while others learn adaptive policies from data (Learning Unmasking Policies[6]). Parallel Sampling Conditional[0] sits squarely within this inference-time policy cluster, emphasizing conditional generation scenarios where the unmasking strategy must account for external constraints or guidance. Compared to nearby works like Dilated Scheduling[5], which focuses on deterministic schedule design, and Optimal Inference Schedules[8], which derives schedules from theoretical principles, Parallel Sampling Conditional[0] appears to prioritize flexible, condition-aware sampling that adapts to task-specific requirements. This positions it as a bridge between fixed scheduling heuristics and fully learned policies, addressing practical deployment challenges where conditioning signals vary widely across applications.

Claimed Contributions

PUNT sampler for parallel token unmasking via contextual independence testing

5 retrieved papers

The authors propose PUNT (Parallel Unmasking with Non-influence Tests), a training-free algorithm that identifies sets of contextually independent tokens for parallel unmasking in masked diffusion models. The method uses a divide-and-conquer strategy with O(log m) model calls per step to test for conditional independence, enabling efficient parallel generation while maintaining quality.

5 retrieved papers

Efficient recursive algorithm with binary encoding for independence testing

6 retrieved papers

The authors develop an efficient iterative implementation of their recursive independence testing procedure using binary encoding of token positions. This transforms the recursive algorithm into a parallel procedure that requires only O(log |M|) forward evaluations per denoising step, where M is the set of masked tokens.

6 retrieved papers

Contextual independence criterion for safe parallel unmasking

4 retrieved papers

The authors formalize contextual independence (Definition 3.1 and 3.2) as the theoretical criterion for determining which tokens can be safely unmasked in parallel. Unlike full statistical independence or confidence-based heuristics, this criterion identifies tokens whose conditional distributions remain unchanged given the current context, ensuring parallel sampling matches sequential sampling.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Plan for Speed--Dilated Scheduling for Masked Diffusion Language Models PDF

Permuter, Haim, Omer Luxembourg, Nachmani, Eliya, H. Permuter, Eliya Nachmani (2025)

[8] Optimal inference schedules for masked diffusion models PDF

Chen, Sitan, Sitan Chen, Li, Jerry, Kevin Cong, Jungshian Li (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PUNT sampler for parallel token unmasking via contextual independence testing

[1] Simplified and generalized masked diffusion for discrete data PDF

Cannot Refute

[27] Masked Diffusion for Generative Recommendation PDF

Cannot Refute

[35] Error Bounds and Optimal Schedules for Masked Diffusions with Factorized Approximations PDF

Cannot Refute

[36] [MASK] is All You Need PDF

Cannot Refute

[37] dUltra: Ultra-Fast Diffusion Language Models via Reinforcement Learning PDF

Cannot Refute

Contribution

Efficient recursive algorithm with binary encoding for independence testing

[29] Divide-and-conquer strategy for large-scale dynamic Bayesian network structure learning PDF

Cannot Refute

[30] Mining text data PDF

Cannot Refute

[31] A survey of text classification algorithms PDF

Cannot Refute

[32] A Divide-Conquer-Reasoning Approach to Consistency Evaluation and Improvement in Blackbox Large Language Models PDF

Cannot Refute

[33] From Conditional Independence to Parallel Execution in Hierarchical Models PDF

Cannot Refute

[34] Conditional Independence Testing for Variable Selection and Causal Inference PDF

Cannot Refute

Contribution

Contextual independence criterion for safe parallel unmasking

[38] Thomas: Trajectory heatmap output with learned multi-agent sampling PDF

Cannot Refute

[39] GenMol: A Drug Discovery Generalist with Discrete Diffusion PDF

Cannot Refute

[40] Conditional models for contextual human motion recognition PDF

Cannot Refute

[41] Probabilistic inferences from conjoined to iterated conditionals PDF

Cannot Refute

Parallel Sampling from Masked Diffusion Models via Conditional Independence Testing

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Plan for Speed--Dilated Scheduling for Masked Diffusion Language Models PDF

[8] Optimal inference schedules for masked diffusion models PDF

Contribution Analysis

PUNT sampler for parallel token unmasking via contextual independence testing

[1] Simplified and generalized masked diffusion for discrete data PDF

[27] Masked Diffusion for Generative Recommendation PDF

[35] Error Bounds and Optimal Schedules for Masked Diffusions with Factorized Approximations PDF

[36] [MASK] is All You Need PDF

[37] dUltra: Ultra-Fast Diffusion Language Models via Reinforcement Learning PDF

Efficient recursive algorithm with binary encoding for independence testing

[29] Divide-and-conquer strategy for large-scale dynamic Bayesian network structure learning PDF

[30] Mining text data PDF

[31] A survey of text classification algorithms PDF

[32] A Divide-Conquer-Reasoning Approach to Consistency Evaluation and Improvement in Blackbox Large Language Models PDF

[33] From Conditional Independence to Parallel Execution in Hierarchical Models PDF

[34] Conditional Independence Testing for Variable Selection and Causal Inference PDF

Contextual independence criterion for safe parallel unmasking

[38] Thomas: Trajectory heatmap output with learned multi-agent sampling PDF

[39] GenMol: A Drug Discovery Generalist with Discrete Diffusion PDF

[40] Conditional models for contextual human motion recognition PDF

[41] Probabilistic inferences from conjoined to iterated conditionals PDF

Table of Contents