Autoregressive Image Generation with Randomized Parallel Decoding

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

autoregressive image generationparallel decodingnext-token prediction

We introduce ARPG, a novel visual Autoregressive model that enables Randomized Parallel Generation, addressing the inherent limitations of conventional raster-order approaches, which hinder inference efficiency and zero-shot generalization due to their sequential, predefined token generation order. Our key insight is that effective random-order modeling necessitates explicit guidance for determining the position of the next predicted token. To this end, we propose a novel decoupled decoding framework that decouples positional guidance from content representation, encoding them separately as queries and key-value pairs. By directly incorporating this guidance into the causal attention mechanism, our approach enables fully random-order training and generation, eliminating the need for bidirectional attention. Consequently, ARPG readily generalizes to zero-shot tasks such as image in-painting, out-painting, and resolution expansion. Furthermore, it supports parallel inference by concurrently processing multiple queries using a shared KV cache. On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.83 with only 32 sampling steps, achieving over a 30 times speedup in inference and and a 75 percent reduction in memory consumption compared to representative recent autoregressive models at a similar scale.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: autoregressive image generation with randomized parallel decoding. The field of autoregressive image generation has evolved beyond strictly sequential token prediction, exploring diverse strategies to accelerate inference while preserving or improving generation quality. The taxonomy reveals several major branches: some methods exploit spatial structure to decode multiple tokens in parallel (e.g., Zipar Spatial Locality[2], Parallelized Autoregressive Visual[1]), while others relax the fixed raster-scan order entirely, training models on random or flexible orderings (e.g., RandAR Random Orders[4], Randomized Autoregressive[11]). A third line of work borrows ideas from speculative execution and iterative refinement (Speculative Jacobi[14], Grouped Speculative[15]), and a fourth branch investigates masked or bidirectional transformers that predict multiple tokens simultaneously (Maskgit[5], Muse[10]). Additional branches address coarse-to-fine hierarchies (Cogview2[12], DetailFlow[9]), hybrid paradigms blending autoregressive and diffusion-like dynamics (Discrete Absorbing Diffusion[16]), and application-specific extensions such as multilingual or cross-domain generation. Among these directions, random-order and flexible-order modeling has attracted growing interest as a way to enable parallel decoding without committing to a single spatial traversal. Randomized Parallel Decoding[0] sits squarely in this branch, training on randomized orderings augmented with position guidance to allow the model to decode multiple tokens concurrently at inference time. This approach contrasts with purely spatial methods like Zipar Spatial Locality[2], which rely on fixed locality patterns, and with speculative techniques like Superposed Decoding[3], which draft and verify tokens in parallel but do not randomize training order. Compared to RandAR Random Orders[4], which also explores random-order training, Randomized Parallel Decoding[0] emphasizes the integration of position cues to steer parallel generation. The central trade-off across these branches remains balancing inference speed, sample quality, and training complexity, with random-order methods offering a flexible middle ground between fully sequential and fully parallel paradigms.

Claimed Contributions

ARPG: Visual autoregressive model with randomized parallel generation

Can Refute

5 retrieved papers

The authors propose ARPG, a visual autoregressive framework that supports fully random-order training and parallel token generation. Unlike conventional raster-order methods, ARPG eliminates sequential constraints and enables efficient inference while maintaining zero-shot generalization capabilities for tasks like inpainting and outpainting.

5 retrieved papers

Can Refute

Decoupled decoding framework with positional guidance

Can Refute

7 retrieved papers

The authors introduce a two-pass decoder architecture that separates content representation learning from position-guided prediction. The first pass uses causal self-attention to build content representations as key-value pairs, while the second pass uses position-aware mask tokens as queries that attend to these representations via causal cross-attention.

7 retrieved papers

Can Refute

Parallel inference with shared KV cache

10 retrieved papers

The framework enables efficient parallel decoding by allowing multiple position-aware queries to simultaneously attend to a shared key-value cache. This design achieves significant speedups (30× over raster-order models, 3× over recent parallel AR models) while reducing memory consumption by 75% compared to similar-scale methods.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] RandAR: Decoder-only Autoregressive Visual Generation in Random Orders PDF

Ziqi Pang, Tianyuan ZHANG, Fujun Luan, Yunze Man, Hao Tan, Kai Zhang, William T. Freeman, Yu-Xiong Wang (2024) • Computer Vision and Pattern Recognition

[31] RandAR: Decoder-only Autoregressive Visual Generation in Random Orders Supplementary Material PDF

A Pseudo-Code (0)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ARPG: Visual autoregressive model with randomized parallel generation

[4] RandAR: Decoder-only Autoregressive Visual Generation in Random Orders PDF

Can Refute

[29] Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation PDF

Cannot Refute

[39] -GPTs: A New Approach to Autoregressive Models PDF

Cannot Refute

[40] Diffusion-based Large Language Models Survey PDF

Cannot Refute

[41] Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding PDF

Cannot Refute

Contribution

Decoupled decoding framework with positional guidance

[38] Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations PDF

Can Refute

[32] Design of a Modified Transformer Architecture Based on Relative Position Coding PDF

Cannot Refute

[33] Architectural entanglement via sequential convergence anchors: A novel framework for latent synchronization in large language models PDF

Cannot Refute

[34] Disentangled sequential autoencoder PDF

Cannot Refute

[35] IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction PDF

Cannot Refute

[36] Generative temporal models with spatial memory for partially observed environments PDF

Cannot Refute

[37] SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation PDF

Cannot Refute

Contribution

Parallel inference with shared KV cache

[42] Seesaw: High-throughput llm inference via model re-sharding PDF

Cannot Refute

[43] Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression PDF

Cannot Refute

[44] Compression Barriers for Autoregressive Transformers PDF

Cannot Refute

[45] GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM PDF

Cannot Refute

[46] Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens PDF

Cannot Refute

[47] Apar: Llms can do auto-parallel auto-regressive decoding PDF

Cannot Refute

[48] MiniCache: KV Cache Compression in Depth Dimension for Large Language Models PDF

Cannot Refute

[49] Longlive: Real-time interactive long video generation PDF

Cannot Refute

[50] Why Are Positional Encodings Nonessential for Deep Autoregressive Transformers? A Petroglyph Revisited PDF

Cannot Refute

[51] Scope: Optimizing key-value cache compression in long-context generation PDF

Cannot Refute

Autoregressive Image Generation with Randomized Parallel Decoding

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] RandAR: Decoder-only Autoregressive Visual Generation in Random Orders PDF

[31] RandAR: Decoder-only Autoregressive Visual Generation in Random Orders Supplementary Material PDF

Contribution Analysis

ARPG: Visual autoregressive model with randomized parallel generation

[4] RandAR: Decoder-only Autoregressive Visual Generation in Random Orders PDF

[29] Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation PDF

[39] -GPTs: A New Approach to Autoregressive Models PDF

[40] Diffusion-based Large Language Models Survey PDF

[41] Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding PDF

Decoupled decoding framework with positional guidance

[38] Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations PDF

[32] Design of a Modified Transformer Architecture Based on Relative Position Coding PDF

[33] Architectural entanglement via sequential convergence anchors: A novel framework for latent synchronization in large language models PDF

[34] Disentangled sequential autoencoder PDF

[35] IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction PDF

[36] Generative temporal models with spatial memory for partially observed environments PDF

[37] SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation PDF

Parallel inference with shared KV cache

[42] Seesaw: High-throughput llm inference via model re-sharding PDF

[43] Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression PDF

[44] Compression Barriers for Autoregressive Transformers PDF

[45] GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM PDF

[46] Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens PDF

[47] Apar: Llms can do auto-parallel auto-regressive decoding PDF

[48] MiniCache: KV Cache Compression in Depth Dimension for Large Language Models PDF

[49] Longlive: Real-time interactive long video generation PDF

[50] Why Are Positional Encodings Nonessential for Deep Autoregressive Transformers? A Petroglyph Revisited PDF

[51] Scope: Optimizing key-value cache compression in long-context generation PDF

Table of Contents