Partition Generative Modeling: Masked Modeling Without Masks

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

masked generative modelingdiscrete diffusionmasked diffusion language modelingdiffusion language modeling

Masked generative models (MGMs) are widely used to capture complex data and enable faster generation than autoregressive models (AR) through parallel decoding. However, MGMs typically operate on fixed-length inputs, which can be inefficient: early in sampling, most tokens are masked and carry little information, leading to wasted computation. In contrast, AR models process only tokens generated previously, making early iterations faster. In this work, we introduce the ``Partition Generative Model'' (PGM), a novel approach that combines the strengths of AR and MGMs. Rather than masking, PGM partitions tokens into two groups and employs sparse attention to block information flow between them. Since there is no information flow between partitions, the model can process the previously-generated tokens only during sampling, while retaining the ability to generate tokens in parallel and in any order. On OpenWebText, PGMs offer at least $5\times$ improvements in sampling latency and throughput, while producing samples with superior generative perplexity, compared to Masked Diffusion Language Models. In the ImageNet dataset, PGMs achieve up to $7\times$ better throughput compared to MaskGIT with only a small change in FID. Finally, we show that PGMs are compatible with distillation methods for MGMs, enabling further inference speedups.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Partition Generative Models (PGM), which partition tokens into groups and use sparse attention to block information flow between partitions, enabling parallel generation without masking. According to the taxonomy, this work resides in the 'Partition-Based Generative Modeling' leaf under 'Token-Level Parallelization'. Notably, this leaf contains only the original paper itself—no sibling papers are listed—suggesting this specific approach to partition-based generation with sparse attention blocking represents a relatively unexplored direction within the broader token-level parallelization landscape.

The taxonomy reveals that PGM sits within a moderately populated parent branch ('Token-Level Parallelization') containing three leaves: Spatial Locality Exploitation (visual tokens), Parallel Token Prediction (joint multi-token prediction), and the original paper's leaf. Neighboring branches include Speculative Decoding (five leaves, draft-verify pipelines) and Attention Mechanism Optimization (three leaves, memory-efficient and sparse attention). The scope notes clarify that PGM differs from masked diffusion methods (excluded from this leaf) and from speculative approaches that use separate draft models. This positioning indicates PGM occupies a niche between pure autoregressive methods and masked generative models, leveraging partitioning rather than speculation or masking.

Among seventeen candidates examined across three contributions, no refutable prior work was identified. The core PGM contribution examined eight candidates with zero refutations, while the distillation compatibility contribution examined nine candidates, also with zero refutations. The encoder-decoder architecture contribution examined no candidates. Given the limited search scope (seventeen papers from semantic search and citation expansion, not exhaustive), these statistics suggest that within the examined literature, no directly overlapping prior work was found. However, the absence of refutations does not confirm absolute novelty—only that the top-K semantic matches did not reveal clear precedents for partition-based generation with sparse attention blocking.

Based on the limited literature search, PGM appears to introduce a distinctive approach within token-level parallelization, occupying an otherwise unpopulated taxonomy leaf. The lack of sibling papers and zero refutations among seventeen candidates examined suggest the specific combination of partitioning and sparse attention for parallel generation may be novel relative to the surveyed literature. However, the analysis covers only top-K semantic matches and does not exhaustively survey all related work in masked generative models, diffusion methods, or parallel decoding strategies.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient parallel token generation through partitioning. The field addresses the challenge of accelerating sequence generation by dividing computational workloads across multiple processing units or temporal stages. The taxonomy reveals several complementary strategies: Attention Mechanism Optimization focuses on reducing the quadratic complexity of self-attention operations (e.g., Flashattention-2[1], Scaling Transformer Inference[2]), while Token-Level Parallelization explores methods that generate multiple tokens simultaneously rather than strictly sequentially. Speculative Decoding introduces draft-and-verify pipelines to overlap computation, and Model Distribution Strategies partition large models across devices or memory hierarchies (e.g., LLM Edge Partitioning[7], Topology Sculptor[6]). Hardware Architectures and Domain-Specific Applications tailor these ideas to specialized processors or tasks such as video coding (VVC Intra Coding[4], CABAC Partitioning[23]), while General Parallel Processing Frameworks provide foundational concurrency primitives (Coordination-Free Queues[16], Parallel Event Processing[31]). Within Token-Level Parallelization, a particularly active line of work investigates partition-based generative modeling, where the sequence is divided into chunks that can be processed in parallel or with reduced dependencies. Partition Generative Modeling[0] exemplifies this approach by structuring generation around explicit partitioning schemes, aiming to balance parallelism with the need to maintain coherent cross-partition dependencies. This contrasts with speculative methods like Easyspec[8] or Collaborative Speculative Decoding[20], which rely on draft models to predict multiple tokens speculatively, and with attention-centric optimizations like Flashattention-2[1] that accelerate existing autoregressive pipelines without altering the generation order. Nearby works such as ZipAR[3] and Parallel Token Prediction[19] also explore token-level parallelism but differ in how they handle inter-token dependencies and verification overhead. The original paper sits squarely in this partition-based cluster, emphasizing structured decomposition over speculative guessing or purely hardware-level acceleration.

Claimed Contributions

Partition Generative Model (PGM)

8 retrieved papers

The authors propose a new generative modeling approach that partitions tokens into two groups instead of masking them. This design allows the model to process only unmasked tokens during sampling while retaining parallel generation capabilities, combining advantages of autoregressive and masked generative models.

8 retrieved papers

Encoder-decoder architecture with group-wise attention

0 retrieved papers

The authors design a specialized transformer architecture featuring an encoder with partition-wise self-attention, a novel GroupSwap layer, and a decoder with cross-attention but no self-attention. This architecture ensures predictions at position i never depend on the token at position i, enabling efficient processing without masked tokens.

0 retrieved papers

Compatibility with distillation methods for MGMs

9 retrieved papers

The authors demonstrate that PGMs can be combined with existing distillation algorithms designed for masked generative models (specifically Self-Distillation Through Time), preserving performance on downstream tasks while achieving additional speedups in inference.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Partition Generative Model (PGM)

[34] Customize your visual autoregressive recipe with set autoregressive modeling PDF

Cannot Refute

[35] Unified Video Generation via Next-Set Prediction in Continuous Domain PDF

Cannot Refute

[36] General point model pretraining with autoencoding and autoregressive PDF

Cannot Refute

[37] MARch'e: Fast Masked Autoregressive Image Generation with Cache-Aware Attention PDF

Cannot Refute

[38] MARchÃ©: Fast Masked Autoregressive Image Generation with Cache-Aware Attention PDF

Cannot Refute

[39] Recursive Autoregressive Depth Estimation with Continuous Token Modeling PDF

Cannot Refute

[40] Autoregression with Self-Token Prediction PDF

Cannot Refute

[41] SMART-3D: Scaling Masked AutoRegressive Transformer for Efficient 3D Shape Generation PDF

Cannot Refute

Contribution

Encoder-decoder architecture with group-wise attention

Contribution

Compatibility with distillation methods for MGMs

[42] On distillation of guided diffusion models PDF

Cannot Refute

[43] Di o: Distilling masked diffusion models into one-step generator PDF

Cannot Refute

[44] Distillspec: Improving speculative decoding via knowledge distillation PDF

Cannot Refute

[45] Integrating masked generative distillation and network compression to identify the severity of wheat fusarium head blight PDF

Cannot Refute

[46] Masked autoencoders enable efficient knowledge distillers PDF

Cannot Refute

[47] Distilling temporal knowledge with masked feature reconstruction for 3D object detection PDF

Cannot Refute

[48] LKD-YOLOv8: A Lightweight Knowledge Distillation-Based Method for Infrared Object Detection PDF

Cannot Refute

[49] Efficientvlm: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning PDF

Cannot Refute

[51] V2X-MGHD: A Collaborative Perception Network for Multiview LiDAR Sensors via Masked Generative Heterogeneous Distillation PDF

Cannot Refute

Partition Generative Modeling: Masked Modeling Without Masks

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Partition Generative Model (PGM)

[34] Customize your visual autoregressive recipe with set autoregressive modeling PDF

[35] Unified Video Generation via Next-Set Prediction in Continuous Domain PDF

[36] General point model pretraining with autoencoding and autoregressive PDF

[37] MARch'e: Fast Masked Autoregressive Image Generation with Cache-Aware Attention PDF

[38] MARchÃ©: Fast Masked Autoregressive Image Generation with Cache-Aware Attention PDF

[39] Recursive Autoregressive Depth Estimation with Continuous Token Modeling PDF

[40] Autoregression with Self-Token Prediction PDF

[41] SMART-3D: Scaling Masked AutoRegressive Transformer for Efficient 3D Shape Generation PDF

Encoder-decoder architecture with group-wise attention

Compatibility with distillation methods for MGMs

[42] On distillation of guided diffusion models PDF

[43] Di o: Distilling masked diffusion models into one-step generator PDF

[44] Distillspec: Improving speculative decoding via knowledge distillation PDF

[45] Integrating masked generative distillation and network compression to identify the severity of wheat fusarium head blight PDF

[46] Masked autoencoders enable efficient knowledge distillers PDF

[47] Distilling temporal knowledge with masked feature reconstruction for 3D object detection PDF

[48] LKD-YOLOv8: A Lightweight Knowledge Distillation-Based Method for Infrared Object Detection PDF

[49] Efficientvlm: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning PDF

[51] V2X-MGHD: A Collaborative Perception Network for Multiview LiDAR Sensors via Masked Generative Heterogeneous Distillation PDF

Table of Contents