Autoregressive Image Generation with Randomized Parallel Decoding

ICLR 2026 Conference SubmissionAnonymous Authors
autoregressive image generationparallel decodingnext-token prediction
Abstract:

We introduce ARPG, a novel visual Autoregressive model that enables Randomized Parallel Generation, addressing the inherent limitations of conventional raster-order approaches, which hinder inference efficiency and zero-shot generalization due to their sequential, predefined token generation order. Our key insight is that effective random-order modeling necessitates explicit guidance for determining the position of the next predicted token. To this end, we propose a novel decoupled decoding framework that decouples positional guidance from content representation, encoding them separately as queries and key-value pairs. By directly incorporating this guidance into the causal attention mechanism, our approach enables fully random-order training and generation, eliminating the need for bidirectional attention. Consequently, ARPG readily generalizes to zero-shot tasks such as image in-painting, out-painting, and resolution expansion. Furthermore, it supports parallel inference by concurrently processing multiple queries using a shared KV cache. On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.83 with only 32 sampling steps, achieving over a 30 times speedup in inference and and a 75 percent reduction in memory consumption compared to representative recent autoregressive models at a similar scale.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
31
3
Claimed Contributions
22
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: autoregressive image generation with randomized parallel decoding. The field of autoregressive image generation has evolved beyond strictly sequential token prediction, exploring diverse strategies to accelerate inference while preserving or improving generation quality. The taxonomy reveals several major branches: some methods exploit spatial structure to decode multiple tokens in parallel (e.g., Zipar Spatial Locality[2], Parallelized Autoregressive Visual[1]), while others relax the fixed raster-scan order entirely, training models on random or flexible orderings (e.g., RandAR Random Orders[4], Randomized Autoregressive[11]). A third line of work borrows ideas from speculative execution and iterative refinement (Speculative Jacobi[14], Grouped Speculative[15]), and a fourth branch investigates masked or bidirectional transformers that predict multiple tokens simultaneously (Maskgit[5], Muse[10]). Additional branches address coarse-to-fine hierarchies (Cogview2[12], DetailFlow[9]), hybrid paradigms blending autoregressive and diffusion-like dynamics (Discrete Absorbing Diffusion[16]), and application-specific extensions such as multilingual or cross-domain generation. Among these directions, random-order and flexible-order modeling has attracted growing interest as a way to enable parallel decoding without committing to a single spatial traversal. Randomized Parallel Decoding[0] sits squarely in this branch, training on randomized orderings augmented with position guidance to allow the model to decode multiple tokens concurrently at inference time. This approach contrasts with purely spatial methods like Zipar Spatial Locality[2], which rely on fixed locality patterns, and with speculative techniques like Superposed Decoding[3], which draft and verify tokens in parallel but do not randomize training order. Compared to RandAR Random Orders[4], which also explores random-order training, Randomized Parallel Decoding[0] emphasizes the integration of position cues to steer parallel generation. The central trade-off across these branches remains balancing inference speed, sample quality, and training complexity, with random-order methods offering a flexible middle ground between fully sequential and fully parallel paradigms.

Claimed Contributions

ARPG: Visual autoregressive model with randomized parallel generation

The authors propose ARPG, a visual autoregressive framework that supports fully random-order training and parallel token generation. Unlike conventional raster-order methods, ARPG eliminates sequential constraints and enables efficient inference while maintaining zero-shot generalization capabilities for tasks like inpainting and outpainting.

5 retrieved papers
Can Refute
Decoupled decoding framework with positional guidance

The authors introduce a two-pass decoder architecture that separates content representation learning from position-guided prediction. The first pass uses causal self-attention to build content representations as key-value pairs, while the second pass uses position-aware mask tokens as queries that attend to these representations via causal cross-attention.

7 retrieved papers
Can Refute
Parallel inference with shared KV cache

The framework enables efficient parallel decoding by allowing multiple position-aware queries to simultaneously attend to a shared key-value cache. This design achieves significant speedups (30× over raster-order models, 3× over recent parallel AR models) while reducing memory consumption by 75% compared to similar-scale methods.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ARPG: Visual autoregressive model with randomized parallel generation

The authors propose ARPG, a visual autoregressive framework that supports fully random-order training and parallel token generation. Unlike conventional raster-order methods, ARPG eliminates sequential constraints and enables efficient inference while maintaining zero-shot generalization capabilities for tasks like inpainting and outpainting.

Contribution

Decoupled decoding framework with positional guidance

The authors introduce a two-pass decoder architecture that separates content representation learning from position-guided prediction. The first pass uses causal self-attention to build content representations as key-value pairs, while the second pass uses position-aware mask tokens as queries that attend to these representations via causal cross-attention.

Contribution

Parallel inference with shared KV cache

The framework enables efficient parallel decoding by allowing multiple position-aware queries to simultaneously attend to a shared key-value cache. This design achieves significant speedups (30× over raster-order models, 3× over recent parallel AR models) while reducing memory consumption by 75% compared to similar-scale methods.