Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes

ICLR 2026 Conference SubmissionAnonymous Authors
diffusion language modeldeletion-insertion processdenoising score entropy
Abstract:

While Masked Diffusion Language Models (MDLMs) relying on token masking and unmasking have shown promise in language modeling, their computational efficiency and generation flexibility remain constrained by the masking paradigm. In this paper, we propose Deletion-Insertion Diffusion language models (DID) that rigorously formulate token deletion and insertion as discrete diffusion processes, replacing the masking and unmasking processes in current MDLMs. DID improves training and inference efficiency by eliminating two major sources of computational overhead in MDLMs: the computations on non-informative 1) \<MASK\texttt{\<MASK\>} tokens inherent to its paradigm, and 2) \<PAD\texttt{\<PAD\>} tokens introduced in variable-length settings. Furthermore, DID offers greater flexibility by: 1) natively supporting variable-length sequences without requiring fixed-length padding, and 2) an intrinsic self-correction mechanism during generation due to insertion that dynamically adjusts token positions. To train DID, we design a score-based approach that assigns scores to token insertion operations and derive appropriate training objectives. The objectives involve subsequence counting problems, which we efficiently solve via a parallelized dynamic programming algorithm. Our experiments across fixed and variable-length settings demonstrate the advantage of DID over baselines of MDLMs and existing insertion-based LMs, in terms of modeling performance, sampling quality, and training/inference speed.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Deletion-Insertion Diffusion (DID) language models that formulate token deletion and insertion as discrete diffusion processes, replacing masking paradigms in existing masked diffusion language models. According to the taxonomy, this work resides in the 'Deletion-Insertion Process Formulations' leaf under 'Core Deletion-Insertion Diffusion Frameworks'. Notably, this leaf contains only the original paper itself with zero sibling papers, indicating a relatively sparse research direction within the broader discrete diffusion landscape of thirteen total papers across eleven leaf nodes.

The taxonomy reveals that the broader field organizes around masking-based approaches (with three distinct subtopics including generalized, conditional, and sparse variants) versus explicit deletion-insertion frameworks. The original paper's leaf sits alongside three other leaves in the core frameworks branch: edit-based reconstruction, general insertion-deletion corruption, and continuous-time Markov chain formulations. The scope note explicitly excludes masking-based approaches and edit-based methods without formal diffusion formulation, positioning DID as pursuing rigorous mathematical foundations for deletion-insertion dynamics distinct from neighboring paradigms.

Among ten candidates examined for the simplified DICE objective contribution, zero were identified as refutable, though all ten were classified as non-refutable-or-unclear. The other two contributions—the core DID framework and the DISE training objective—had zero candidates examined, suggesting the literature search focused primarily on training methodology rather than the fundamental deletion-insertion formulation. Given the limited search scope of ten candidates total and the sparse taxonomy leaf (no siblings), the analysis provides initial signals but cannot comprehensively assess novelty across the full discrete diffusion literature.

Based on top-ten semantic matches, the work appears to occupy a distinct position within discrete diffusion language modeling, though the small candidate pool and absence of sibling papers in the taxonomy limit definitive conclusions. The analysis captures immediate neighborhood relationships but does not exhaustively cover all potential overlaps with masking-based methods or continuous-time formulations in adjacent taxonomy branches.

Taxonomy

Core-task Taxonomy Papers
13
3
Claimed Contributions
10
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: discrete diffusion language modeling via deletion and insertion processes. The field has organized itself around several complementary directions. At the foundation lie core deletion-insertion diffusion frameworks that directly model text generation through iterative removal and addition of tokens, exemplified by works such as Deletion Insertion Diffusion[0] and Insertion Deletion Denoising[5]. A parallel branch focuses on masking-based discrete diffusion, which treats masked tokens as a special corruption mechanism rather than explicit deletions. Multimodal and vision-language discrete diffusion extends these ideas beyond pure text, incorporating image or cross-modal conditioning. Applications and specialized domains explore how these generative processes can be adapted to tasks like stylized translation or watermarking, while survey and review literature provides broader perspectives on the landscape. Within the core deletion-insertion frameworks, a handful of works have explored different formulations of the corruption and denoising processes. Some emphasize curriculum-based strategies for controlling generation complexity, as seen in Curriculum Stylized Translation[3], while others investigate generalized interpolation schemes that unify deletion and insertion with other discrete transitions, such as Generalized Interpolating Diffusion[7]. Deletion Insertion Diffusion[0] sits squarely in this foundational branch, proposing a principled formulation of how tokens are iteratively deleted and reinserted during the diffusion process. Compared to earlier insertion-based models like Insertion Deletion Denoising[5], which laid groundwork for non-autoregressive generation, the original paper refines the theoretical underpinnings and explores richer dynamics between deletion and insertion steps. This positioning highlights ongoing efforts to balance tractability, expressiveness, and sample quality in discrete diffusion for language.

Claimed Contributions

Deletion-Insertion Diffusion language models (DID)

The authors introduce DID, a novel discrete diffusion paradigm that replaces masking-unmasking in MDLMs with deletion-insertion processes. This eliminates <MASK> and <PAD> tokens, improving computational efficiency and enabling native variable-length sequence support with intrinsic self-correction during generation.

0 retrieved papers
Denoising Insertion Score Entropy (DISE) training objective

The authors develop DISE, a score-based training objective for learning DID's insertion process. They define an insertion score modeling the probability of inserting any token at any position, derive the DISE objective involving subsequence count ratios, and provide an efficient parallelized dynamic programming algorithm to compute these ratios.

0 retrieved papers
Simplified DICE objective for fixed-length settings

For fixed-length data, the authors show that the insertion score becomes time-independent and satisfies a sequence-level normalization property. This enables a simplified Denoising Insertion Cross Entropy (DICE) objective that improves parameterization and learning efficiency in fixed-length language modeling benchmarks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Deletion-Insertion Diffusion language models (DID)

The authors introduce DID, a novel discrete diffusion paradigm that replaces masking-unmasking in MDLMs with deletion-insertion processes. This eliminates <MASK> and <PAD> tokens, improving computational efficiency and enabling native variable-length sequence support with intrinsic self-correction during generation.

Contribution

Denoising Insertion Score Entropy (DISE) training objective

The authors develop DISE, a score-based training objective for learning DID's insertion process. They define an insertion score modeling the probability of inserting any token at any position, derive the DISE objective involving subsequence count ratios, and provide an efficient parallelized dynamic programming algorithm to compute these ratios.

Contribution

Simplified DICE objective for fixed-length settings

For fixed-length data, the authors show that the insertion score becomes time-independent and satisfies a sequence-level normalization property. This enables a simplified Denoising Insertion Cross Entropy (DICE) objective that improves parameterization and learning efficiency in fixed-length language modeling benchmarks.