Eliminating VAE for Fast and High-Resolution Generative Detail Restoration

ICLR 2026 Conference SubmissionAnonymous Authors
DiffusionSuper-ResolutionAdversarial distillationModel Compression
Abstract:

Diffusion models have attained remarkable breakthroughs in the real-world super-resolution (SR) task, albeit at slow inference and high demand on devices. To accelerate inference, recent works like GenDR adopt step distillation to minimize the step number to one. However, the memory boundary still restricts the maximum processing size, necessitating tile-by-tile restoration of high-resolution images. Through profiling the pipeline, we pinpoint that the variational auto-encoder (VAE) is the bottleneck of latency and memory. To completely solve the problem, we leverage pixel-(un)shuffle operations to eliminate the VAE, reversing the latent-based GenDR to pixel-space GenDR-Pix. However, upscale with ×\times8 pixelshuffle may induce artifacts of repeated patterns. To alleviate the distortion, we propose a multi-stage adversarial distillation to progressively remove the encoder and decoder. Specifically, we utilize generative features from the previous stage models to guide adversarial discrimination. Moreover, we propose random padding to augment generative features and avoid discriminator collapse. We also introduce a masked Fourier space loss to penalize the outliers of amplitude. To improve inference performance, we empirically integrate a padding-based self-ensemble with classifier-free guidance to improve inference scaling. Experimental results show that GenDR-Pix performs 2.8×\times acceleration and 60% memory-saving compared to GenDR with negligible visual degradation, surpassing other one-step diffusion SR. Against all odds, GenDR-Pix can restore 4K image in only 1 second and 6 GB.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes eliminating the variational autoencoder from diffusion-based super-resolution pipelines to address latency and memory bottlenecks, introducing a pixel-space variant called GenDR-Pix. It resides in the 'One-Step and Few-Step Diffusion Models' leaf, which contains nine papers focused on achieving super-resolution in minimal diffusion steps via distillation or direct training. This is a moderately populated research direction within the broader taxonomy of 46 papers, indicating active interest in step reduction strategies but not extreme saturation.

The taxonomy reveals neighboring approaches in 'Residual and Latent Space Diffusion Acceleration' (five papers) and 'Adaptive and Region-Aware Acceleration' (four papers), both exploring alternative efficiency pathways. While siblings like SinSR and AddSR retain VAE components and focus on distillation or single-image training, this work diverges by removing the encoder-decoder entirely and operating in pixel space. The 'Diffusion-GAN Hybrid Models' branch (four papers) shares adversarial training elements but integrates GANs differently, whereas this paper uses adversarial distillation specifically to progressively eliminate VAE stages.

Among 20 candidates examined, the multi-stage adversarial distillation contribution shows one refutable candidate out of 10 examined, suggesting some prior overlap in progressive distillation techniques. The masked Fourier space loss contribution examined 10 candidates with none refutable, indicating relative novelty in frequency-domain artifact mitigation within this limited search scope. The padding-based classifier-free guidance contribution was not examined against candidates. These statistics reflect a focused semantic search rather than exhaustive coverage, leaving open the possibility of additional relevant work beyond the top-20 matches.

Based on the limited search scope of 20 semantically similar papers, the work appears to occupy a distinct position by targeting VAE elimination rather than VAE optimization. The taxonomy structure suggests this direction is less explored than distillation-based acceleration, though the single refutable candidate for adversarial distillation indicates some methodological overlap. The analysis does not cover broader diffusion acceleration literature outside the top-20 semantic neighborhood or recent preprints that may address similar bottlenecks.

Taxonomy

Core-task Taxonomy Papers
46
3
Claimed Contributions
20
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Accelerating diffusion-based image super-resolution by eliminating variational autoencoders. The field of diffusion-based super-resolution has evolved into several distinct branches addressing the fundamental tension between sample quality and computational cost. Sampling Step Reduction and Distillation Methods focus on compressing multi-step diffusion processes into one-step or few-step alternatives through distillation techniques and consistency models, exemplified by works like SinSR[3] and AddSR[8]. Hybrid Generative Architectures blend diffusion models with GANs or other generative frameworks to leverage complementary strengths, as seen in Denoising Diffusion GANS[1] and Hybrid Conditional Diffusion[2]. Architectural and Computational Optimization pursues efficiency through model compression, quantization, and lightweight designs such as TinySR[13] and Adversarial Diffusion Compression[12]. Domain-Specific and Conditional Diffusion Methods tailor diffusion processes to particular imaging modalities or conditioning signals, while Real-World and Blind Super-Resolution tackles degradation uncertainty in practical scenarios using approaches like ResShift[4]. A particularly active research direction involves reducing sampling overhead without sacrificing perceptual quality, where methods explore distillation, flow matching, and residual shifting strategies. Eliminating VAE[0] sits squarely within the one-step and few-step diffusion cluster, proposing to bypass the variational autoencoder bottleneck that typically adds latency and complexity to latent diffusion pipelines. This approach contrasts with neighbors like Consistency Rectified Flow[33] and One-Step Residual Shifting[35], which retain VAE components but accelerate sampling through alternative formulations of the denoising trajectory. Compared to SinSR[3], which emphasizes single-image training regimes, Eliminating VAE[0] focuses on architectural streamlining for faster inference across diverse inputs. The central question remains whether removing the VAE entirely can preserve the distributional advantages of latent-space diffusion while achieving competitive speed-quality trade-offs relative to hybrid or distilled alternatives.

Claimed Contributions

Multi-stage adversarial distillation for VAE elimination

The authors propose a two-stage training procedure that gradually replaces the VAE encoder and decoder with pixel-unshuffle and pixel-shuffle operations. Stage I removes the encoder using latent matching and adversarial learning; Stage II removes the decoder using the Stage I model as discriminator, incorporating random padding augmentation and masked Fourier space loss to prevent artifacts.

10 retrieved papers
Can Refute
Masked Fourier space loss for artifact mitigation

A frequency-domain loss function is introduced to suppress periodic artifacts caused by large-scale pixel-shuffle operations. The loss applies a band-rejection filter in the Fourier domain to penalize anomalous spike amplitudes that correspond to repeated pattern artifacts.

10 retrieved papers
Padding-based classifier-free guidance (PadCFG)

An inference-time strategy that empirically integrates self-ensemble and classifier-free guidance by using different padding configurations for positive and negative conditions. This approach reduces artifacts while maintaining computational efficiency compared to full self-ensemble methods.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Multi-stage adversarial distillation for VAE elimination

The authors propose a two-stage training procedure that gradually replaces the VAE encoder and decoder with pixel-unshuffle and pixel-shuffle operations. Stage I removes the encoder using latent matching and adversarial learning; Stage II removes the decoder using the Stage I model as discriminator, incorporating random padding augmentation and masked Fourier space loss to prevent artifacts.

Contribution

Masked Fourier space loss for artifact mitigation

A frequency-domain loss function is introduced to suppress periodic artifacts caused by large-scale pixel-shuffle operations. The loss applies a band-rejection filter in the Fourier domain to penalize anomalous spike amplitudes that correspond to repeated pattern artifacts.

Contribution

Padding-based classifier-free guidance (PadCFG)

An inference-time strategy that empirically integrates self-ensemble and classifier-free guidance by using different padding configurations for positive and negative conditions. This approach reduces artifacts while maintaining computational efficiency compared to full self-ensemble methods.

Eliminating VAE for Fast and High-Resolution Generative Detail Restoration | Novelty Validation