Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

ICLR 2026 Conference SubmissionAnonymous Authors
Efficient Autoregressive Image GenerationParallel Decoding
Abstract:

We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation. Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency. Existing works have tried to parallelize next-patch prediction by shifting to multi-patch prediction to accelerate the process, but only achieved limited parallelization. To achieve high parallelization while maintaining generation quality, we introduce two key techniques: (1) Flexible Parallelized Autoregressive Modeling, a novel architecture that enables arbitrary generation ordering and degrees of parallelization. It uses learnable position query tokens to guide generation at target positions while ensuring mutual visibility among concurrently generated tokens for consistent parallel decoding. (2) Locality-aware Generation Ordering, a novel schedule that forms groups to minimize intra-group dependencies and maximize contextual support, enhancing generation quality. With these designs, we reduce the generation steps from 256 to 20 (256×256 res.) and 1024 to 48 (512×512 res.) without compromising quality on the ImageNet class-conditional generation, and achieving at least 3.4× lower latency than previous parallelized autoregressive models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
48
3
Claimed Contributions
26
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: accelerating autoregressive image generation through parallel decoding. The field addresses the inherent sequential bottleneck of autoregressive models by exploring diverse strategies to predict multiple tokens simultaneously. The taxonomy reveals a rich landscape organized around twelve major branches. Spatial Locality-Based Parallel Decoding exploits the natural correlation among neighboring image patches to enable concurrent predictions, as seen in works like Zipar[1] and Parallelized Autoregressive Visual[2]. Hierarchical and Multi-Scale Autoregressive Modeling decomposes generation into coarse-to-fine stages, while Block-Based and Semi-Autoregressive Decoding groups tokens into chunks for batch processing. Random-Order and Flexible-Order approaches relax strict raster-scan dependencies, and Speculative and Iterative Parallel Decoding methods draft multiple candidates in parallel before verification. Masked and Non-Autoregressive branches draw inspiration from diffusion and masked language models, whereas Retrieval-Augmented and Context-Aware Generation incorporates external knowledge. Additional branches cover variational latent models, bidirectional architectures, system-level optimizations, domain-specific extensions, and theoretical foundations, reflecting the breadth of innovation in this space. Several active lines of work highlight contrasting trade-offs between generation quality, speed, and architectural complexity. Spatial locality methods such as Neighboring Autoregressive Modeling[4] and Next Block Prediction[5] achieve strong speedups by predicting contiguous regions, yet must carefully balance parallelism with maintaining coherence across boundaries. Speculative techniques and iterative refinement offer flexible acceleration but introduce verification overhead. Locality Parallel Decoding[0] sits within the Spatial Locality-Based branch under Flexible Parallelized Autoregressive Modeling, emphasizing adaptive parallel prediction guided by local dependencies. Compared to neighbors like Parallelized Autoregressive Visual[2], which also leverages spatial structure, Locality Parallel Decoding[0] appears to focus on dynamic locality-aware scheduling rather than fixed block partitioning. This positioning reflects a broader trend toward flexible, content-adaptive parallelization strategies that aim to preserve autoregressive quality while unlocking substantial inference speedups.

Claimed Contributions

Flexible Parallelized Autoregressive Modeling

The authors introduce a novel architecture that decouples context representation from token generation by using learnable position query tokens. This design enables arbitrary generation order and degrees of parallelization while maintaining mutual visibility among concurrently generated tokens through specialized attention mechanisms, and inherits KV caching to avoid redundant computation.

10 retrieved papers
Can Refute
Locality-aware Generation Ordering

The authors propose a generation order schedule guided by two principles: selecting target positions spatially close to existing context for strong conditioning, and ensuring concurrently generated tokens are spatially distant to reduce mutual dependency. This schedule leverages spatial locality patterns observed in autoregressive image generation attention.

7 retrieved papers
Locality-aware Parallel Decoding Framework

The authors present a complete framework combining flexible parallelized autoregressive modeling with locality-aware generation ordering to significantly reduce generation steps (from 256 to 20 for 256×256 resolution and 1024 to 48 for 512×512 resolution) while maintaining generation quality and achieving at least 3.4× lower latency than previous parallelized autoregressive models.

9 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Flexible Parallelized Autoregressive Modeling

The authors introduce a novel architecture that decouples context representation from token generation by using learnable position query tokens. This design enables arbitrary generation order and degrees of parallelization while maintaining mutual visibility among concurrently generated tokens through specialized attention mechanisms, and inherits KV caching to avoid redundant computation.

Contribution

Locality-aware Generation Ordering

The authors propose a generation order schedule guided by two principles: selecting target positions spatially close to existing context for strong conditioning, and ensuring concurrently generated tokens are spatially distant to reduce mutual dependency. This schedule leverages spatial locality patterns observed in autoregressive image generation attention.

Contribution

Locality-aware Parallel Decoding Framework

The authors present a complete framework combining flexible parallelized autoregressive modeling with locality-aware generation ordering to significantly reduce generation steps (from 256 to 20 for 256×256 resolution and 1024 to 48 for 512×512 resolution) while maintaining generation quality and achieving at least 3.4× lower latency than previous parallelized autoregressive models.