Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling

ICLR 2026 Conference SubmissionAnonymous Authors
generation then reconstructionaccelerationmasked autoregregrassive modelimage synthesis
Abstract:

Masked Autoregressive (MAR) models promise better efficiency in visual generation than continuous autoregressive (AR) models for the ability of parallel generation, yet their acceleration potential remains constrained by the modeling complexity of spatially correlated visual tokens in a single step. To address this limitation, we introduce Generation then Reconstruction (GtR), a training-free hierarchical sampling strategy that decomposes generation into two stages: structure generation establishing global semantic scaffolding, followed by detail reconstruction efficiently completing remaining tokens. Assuming that it is more difficult to create an image from scratch than to complement images based on a basic image framework, GtR is designed to achieve acceleration by computing the reconstruction stage quickly while maintaining the generation quality by computing the generation stage slowly. Moreover, observing that tokens on the details of an image often carry more semantic information than tokens in the salient regions, we further propose Frequency-Weighted Token Selection (FTS) to offer more computation budget to tokens on image details, which are localized based on the energy of high frequency information. Extensive experiments on ImageNet class-conditional and text-to-image generation demonstrate 3.72×\times speedup on MAR-H while maintaining comparable quality (e.g., FID: 1.59, IS: 304.4 vs. original 1.59, 299.1), substantially outperforming existing acceleration methods across various model scales and generation tasks. Our codes have been released in supplementary materials and will be released on Github.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a training-free hierarchical sampling strategy for masked autoregressive visual generation, decomposing inference into structure generation followed by detail reconstruction. It resides in the Hierarchical and Selective Sampling leaf under Inference Acceleration Techniques, which contains only two papers including this work. This leaf represents a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting that hierarchical two-stage sampling strategies remain underexplored compared to architectural innovations or training paradigm modifications.

The taxonomy reveals that neighboring research directions pursue complementary acceleration goals through different mechanisms. Feature and KV Caching Mechanisms leverage token redundancy to reduce bidirectional attention overhead, while Model Architecture Innovations explore alternative tokenization and prediction paradigms. The Hierarchical and Multi-Scale Generation branch addresses coarse-to-fine decomposition at the architectural level rather than as an inference-time strategy. The paper's approach diverges from caching-based methods by explicitly separating semantic scaffolding from detail completion, positioning it at the intersection of hierarchical decomposition and selective token prioritization.

Among twenty-six candidates examined, the Generation then Reconstruction strategy shows overlap with one prior work across ten candidates, while Frequency-Weighted Token Selection encounters one refutable candidate among six examined. The stage-aware diffusion scheduling mechanism appears more novel, with zero refutable candidates across ten examined papers. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The two-stage decomposition concept appears less novel than the frequency-based token weighting and scheduling components, though the specific combination may offer distinct contributions.

Based on the limited literature search, the work occupies a sparsely populated research direction with only one sibling paper in its taxonomy leaf. The hierarchical sampling strategy shows some prior overlap, while the frequency-weighting and scheduling mechanisms appear more distinctive among the examined candidates. The analysis covers top-K semantic matches and does not claim exhaustive field coverage, leaving open the possibility of additional related work beyond the twenty-six papers examined.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
26
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Accelerating masked autoregressive visual generation. The field addresses the computational bottleneck of autoregressive image synthesis by exploring diverse strategies to reduce inference cost while maintaining generation quality. The taxonomy reveals a rich landscape organized around ten major branches. Inference Acceleration Techniques focus on hierarchical and selective sampling methods that skip or prioritize tokens during generation, as seen in LazyMAR[3] and Frequency Aware Autoregressive[22]. Model Architecture Innovations introduce novel backbone designs and token representations, while Training Paradigms and Optimization refine learning objectives and schedules. Hierarchical and Multi-Scale Generation methods like HMAR[9] decompose images across resolutions, and Parallel and Efficient Decoding approaches such as Parallelized Autoregressive Visual[1] enable non-sequential token prediction. Domain-Specific Applications, Conditional and Controllable Generation, and Unified and Multimodal Frameworks address specialized use cases, whereas Foundational Autoregressive Methods anchor the taxonomy with classical techniques like PixelCNN[44] and MaskGIT[17]. A particularly active tension emerges between methods that accelerate inference through selective token generation versus those that redesign the generation order or architecture itself. Works like Visual Autoregressive Modeling[8] and Next Patch Prediction[4] explore alternative orderings, while Neighboring Autoregressive Modeling[5] rethinks spatial dependencies. Generation then Reconstruction[0] sits within the Hierarchical and Selective Sampling cluster, sharing conceptual ground with Frequency Aware Autoregressive[22] by prioritizing which tokens to generate first. Unlike LazyMAR[3], which adaptively skips less critical tokens, Generation then Reconstruction[0] emphasizes a two-stage process that separates coarse generation from refinement. This positions it as a bridge between hierarchical decomposition strategies and selective sampling heuristics, contributing to the broader question of how to optimally allocate computational budget across the token sequence without sacrificing perceptual fidelity.

Claimed Contributions

Generation then Reconstruction (GtR) hierarchical sampling strategy

GtR is a training-free two-stage sampling method for masked autoregressive models. The first stage generates spatially non-adjacent tokens slowly to establish semantic structure, while the second stage rapidly reconstructs remaining tokens in very few steps, achieving acceleration without quality loss.

10 retrieved papers
Can Refute
Frequency-Weighted Token Selection (FTS)

FTS is a training-free strategy that allocates more diffusion steps to tokens with complex details by identifying high-frequency tokens through Fourier transformation of token latents, thereby improving computational efficiency while preserving generation quality.

6 retrieved papers
Can Refute
Stage-aware diffusion scheduling mechanism

A diffusion scheduling approach that adapts the number of diffusion steps across generation stages, using linearly decreasing steps during structure generation and fixed steps during reconstruction, reflecting the varying modeling complexity at different stages.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Generation then Reconstruction (GtR) hierarchical sampling strategy

GtR is a training-free two-stage sampling method for masked autoregressive models. The first stage generates spatially non-adjacent tokens slowly to establish semantic structure, while the second stage rapidly reconstructs remaining tokens in very few steps, achieving acceleration without quality loss.

Contribution

Frequency-Weighted Token Selection (FTS)

FTS is a training-free strategy that allocates more diffusion steps to tokens with complex details by identifying high-frequency tokens through Fourier transformation of token latents, thereby improving computational efficiency while preserving generation quality.

Contribution

Stage-aware diffusion scheduling mechanism

A diffusion scheduling approach that adapts the number of diffusion steps across generation stages, using linearly decreasing steps during structure generation and fixed steps during reconstruction, reflecting the varying modeling complexity at different stages.