Partition Generative Modeling: Masked Modeling Without Masks

ICLR 2026 Conference SubmissionAnonymous Authors
masked generative modelingdiscrete diffusionmasked diffusion language modelingdiffusion language modeling
Abstract:

Masked generative models (MGMs) are widely used to capture complex data and enable faster generation than autoregressive models (AR) through parallel decoding. However, MGMs typically operate on fixed-length inputs, which can be inefficient: early in sampling, most tokens are masked and carry little information, leading to wasted computation. In contrast, AR models process only tokens generated previously, making early iterations faster. In this work, we introduce the ``Partition Generative Model'' (PGM), a novel approach that combines the strengths of AR and MGMs. Rather than masking, PGM partitions tokens into two groups and employs sparse attention to block information flow between them. Since there is no information flow between partitions, the model can process the previously-generated tokens only during sampling, while retaining the ability to generate tokens in parallel and in any order. On OpenWebText, PGMs offer at least 5×5\times improvements in sampling latency and throughput, while producing samples with superior generative perplexity, compared to Masked Diffusion Language Models. In the ImageNet dataset, PGMs achieve up to 7×7\times better throughput compared to MaskGIT with only a small change in FID. Finally, we show that PGMs are compatible with distillation methods for MGMs, enabling further inference speedups.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Partition Generative Models (PGM), which partition tokens into groups and use sparse attention to block information flow between partitions, enabling parallel generation without masking. According to the taxonomy, this work resides in the 'Partition-Based Generative Modeling' leaf under 'Token-Level Parallelization'. Notably, this leaf contains only the original paper itself—no sibling papers are listed—suggesting this specific approach to partition-based generation with sparse attention blocking represents a relatively unexplored direction within the broader token-level parallelization landscape.

The taxonomy reveals that PGM sits within a moderately populated parent branch ('Token-Level Parallelization') containing three leaves: Spatial Locality Exploitation (visual tokens), Parallel Token Prediction (joint multi-token prediction), and the original paper's leaf. Neighboring branches include Speculative Decoding (five leaves, draft-verify pipelines) and Attention Mechanism Optimization (three leaves, memory-efficient and sparse attention). The scope notes clarify that PGM differs from masked diffusion methods (excluded from this leaf) and from speculative approaches that use separate draft models. This positioning indicates PGM occupies a niche between pure autoregressive methods and masked generative models, leveraging partitioning rather than speculation or masking.

Among seventeen candidates examined across three contributions, no refutable prior work was identified. The core PGM contribution examined eight candidates with zero refutations, while the distillation compatibility contribution examined nine candidates, also with zero refutations. The encoder-decoder architecture contribution examined no candidates. Given the limited search scope (seventeen papers from semantic search and citation expansion, not exhaustive), these statistics suggest that within the examined literature, no directly overlapping prior work was found. However, the absence of refutations does not confirm absolute novelty—only that the top-K semantic matches did not reveal clear precedents for partition-based generation with sparse attention blocking.

Based on the limited literature search, PGM appears to introduce a distinctive approach within token-level parallelization, occupying an otherwise unpopulated taxonomy leaf. The lack of sibling papers and zero refutations among seventeen candidates examined suggest the specific combination of partitioning and sparse attention for parallel generation may be novel relative to the surveyed literature. However, the analysis covers only top-K semantic matches and does not exhaustively survey all related work in masked generative models, diffusion methods, or parallel decoding strategies.

Taxonomy

Core-task Taxonomy Papers
33
3
Claimed Contributions
17
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: efficient parallel token generation through partitioning. The field addresses the challenge of accelerating sequence generation by dividing computational workloads across multiple processing units or temporal stages. The taxonomy reveals several complementary strategies: Attention Mechanism Optimization focuses on reducing the quadratic complexity of self-attention operations (e.g., Flashattention-2[1], Scaling Transformer Inference[2]), while Token-Level Parallelization explores methods that generate multiple tokens simultaneously rather than strictly sequentially. Speculative Decoding introduces draft-and-verify pipelines to overlap computation, and Model Distribution Strategies partition large models across devices or memory hierarchies (e.g., LLM Edge Partitioning[7], Topology Sculptor[6]). Hardware Architectures and Domain-Specific Applications tailor these ideas to specialized processors or tasks such as video coding (VVC Intra Coding[4], CABAC Partitioning[23]), while General Parallel Processing Frameworks provide foundational concurrency primitives (Coordination-Free Queues[16], Parallel Event Processing[31]). Within Token-Level Parallelization, a particularly active line of work investigates partition-based generative modeling, where the sequence is divided into chunks that can be processed in parallel or with reduced dependencies. Partition Generative Modeling[0] exemplifies this approach by structuring generation around explicit partitioning schemes, aiming to balance parallelism with the need to maintain coherent cross-partition dependencies. This contrasts with speculative methods like Easyspec[8] or Collaborative Speculative Decoding[20], which rely on draft models to predict multiple tokens speculatively, and with attention-centric optimizations like Flashattention-2[1] that accelerate existing autoregressive pipelines without altering the generation order. Nearby works such as ZipAR[3] and Parallel Token Prediction[19] also explore token-level parallelism but differ in how they handle inter-token dependencies and verification overhead. The original paper sits squarely in this partition-based cluster, emphasizing structured decomposition over speculative guessing or purely hardware-level acceleration.

Claimed Contributions

Partition Generative Model (PGM)

The authors propose a new generative modeling approach that partitions tokens into two groups instead of masking them. This design allows the model to process only unmasked tokens during sampling while retaining parallel generation capabilities, combining advantages of autoregressive and masked generative models.

8 retrieved papers
Encoder-decoder architecture with group-wise attention

The authors design a specialized transformer architecture featuring an encoder with partition-wise self-attention, a novel GroupSwap layer, and a decoder with cross-attention but no self-attention. This architecture ensures predictions at position i never depend on the token at position i, enabling efficient processing without masked tokens.

0 retrieved papers
Compatibility with distillation methods for MGMs

The authors demonstrate that PGMs can be combined with existing distillation algorithms designed for masked generative models (specifically Self-Distillation Through Time), preserving performance on downstream tasks while achieving additional speedups in inference.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Partition Generative Model (PGM)

The authors propose a new generative modeling approach that partitions tokens into two groups instead of masking them. This design allows the model to process only unmasked tokens during sampling while retaining parallel generation capabilities, combining advantages of autoregressive and masked generative models.

Contribution

Encoder-decoder architecture with group-wise attention

The authors design a specialized transformer architecture featuring an encoder with partition-wise self-attention, a novel GroupSwap layer, and a decoder with cross-attention but no self-attention. This architecture ensures predictions at position i never depend on the token at position i, enabling efficient processing without masked tokens.

Contribution

Compatibility with distillation methods for MGMs

The authors demonstrate that PGMs can be combined with existing distillation algorithms designed for masked generative models (specifically Self-Distillation Through Time), preserving performance on downstream tasks while achieving additional speedups in inference.