Autoregressive Image Generation with Randomized Parallel Decoding
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose ARPG, a visual autoregressive framework that supports fully random-order training and parallel token generation. Unlike conventional raster-order methods, ARPG eliminates sequential constraints and enables efficient inference while maintaining zero-shot generalization capabilities for tasks like inpainting and outpainting.
The authors introduce a two-pass decoder architecture that separates content representation learning from position-guided prediction. The first pass uses causal self-attention to build content representations as key-value pairs, while the second pass uses position-aware mask tokens as queries that attend to these representations via causal cross-attention.
The framework enables efficient parallel decoding by allowing multiple position-aware queries to simultaneously attend to a shared key-value cache. This design achieves significant speedups (30× over raster-order models, 3× over recent parallel AR models) while reducing memory consumption by 75% compared to similar-scale methods.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[4] RandAR: Decoder-only Autoregressive Visual Generation in Random Orders PDF
[31] RandAR: Decoder-only Autoregressive Visual Generation in Random Orders Supplementary Material PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
ARPG: Visual autoregressive model with randomized parallel generation
The authors propose ARPG, a visual autoregressive framework that supports fully random-order training and parallel token generation. Unlike conventional raster-order methods, ARPG eliminates sequential constraints and enables efficient inference while maintaining zero-shot generalization capabilities for tasks like inpainting and outpainting.
[4] RandAR: Decoder-only Autoregressive Visual Generation in Random Orders PDF
[29] Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation PDF
[39] -GPTs: A New Approach to Autoregressive Models PDF
[40] Diffusion-based Large Language Models Survey PDF
[41] Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding PDF
Decoupled decoding framework with positional guidance
The authors introduce a two-pass decoder architecture that separates content representation learning from position-guided prediction. The first pass uses causal self-attention to build content representations as key-value pairs, while the second pass uses position-aware mask tokens as queries that attend to these representations via causal cross-attention.
[38] Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations PDF
[32] Design of a Modified Transformer Architecture Based on Relative Position Coding PDF
[33] Architectural entanglement via sequential convergence anchors: A novel framework for latent synchronization in large language models PDF
[34] Disentangled sequential autoencoder PDF
[35] IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction PDF
[36] Generative temporal models with spatial memory for partially observed environments PDF
[37] SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation PDF
Parallel inference with shared KV cache
The framework enables efficient parallel decoding by allowing multiple position-aware queries to simultaneously attend to a shared key-value cache. This design achieves significant speedups (30× over raster-order models, 3× over recent parallel AR models) while reducing memory consumption by 75% compared to similar-scale methods.