GenDR: Lighten Generative Detail Restoration

ICLR 2026 Conference SubmissionAnonymous Authors
DiffusionSuper-ResolutionScore distillation
Abstract:

Although recent research applying text-to-image (T2I) diffusion models to real-world super-resolution (SR) has achieved remarkable progress, the misalignment of their targets leads to a suboptimal trade-off between inference speed and detail fidelity. Specifically, the T2I task requires multiple inference steps to synthesize images matching to prompts and reduces the latent dimension to lower generating difficulty. Contrariwise, SR can restore high-frequency details in fewer inference steps, but it necessitates a more reliable variational auto-encoder (VAE) to preserve input information. However, most diffusion-based SRs are multistep and use 4-channel VAEs, while existing models with 16-channel VAEs are overqualified diffusion transformers, e.g., FLUX (12B). To align the target, we present a one-step diffusion model for generative detail restoration, GenDR, distilled from a tailored diffusion model with a larger latent space. In detail, we train a new SD2.1-VAE16 (0.9B) via representation alignment to expand the latent space without increasing the model size. Regarding step distillation, we propose consistent score identity distillation (CiD) that incorporates SR task-specific loss into score distillation to leverage more SR priors and align the training target. Furthermore, we extend CiD with adversarial learning and representation alignment (CiDA) to enhance perceptual quality and accelerate training. We also polish the pipeline to achieve a more efficient inference. Experimental results demonstrate that GenDR achieves state-of-the-art performance in both quantitative metrics and visual fidelity.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces GenDR, a one-step diffusion model for generative detail restoration in super-resolution, combining a novel 16-channel VAE (SD2.1-VAE16) with consistent score identity distillation (CiD). It resides in the One-Step Diffusion Models leaf, which contains eight papers total, indicating a moderately populated research direction within the broader Inference Acceleration and Efficiency branch. This positioning reflects the field's active pursuit of minimal-latency diffusion methods that compress multi-step sampling into single forward passes while preserving perceptual quality.

The taxonomy reveals neighboring leaves addressing related acceleration challenges: Few-Step Diffusion Models explores partial trajectory compression, Adaptive and Dynamic Acceleration applies content-aware speedups, and Lightweight Architectures reduces parameter counts. GenDR's approach diverges by targeting one-step inference through distillation rather than adaptive sampling or architectural pruning. Its use of a larger latent space (16-channel VAE) also connects to Fidelity and Structure Preservation concerns, as expanding latent dimensionality aims to retain input information that standard 4-channel VAEs might discard during aggressive step reduction.

Among 21 candidates examined, the SD2.1-VAE16 contribution shows one refutable candidate from one examined, suggesting prior work on expanded VAE architectures exists within the limited search scope. The CiD distillation method examined ten candidates with one refutable match, indicating some overlap in task-specific distillation strategies but leaving nine non-refutable or unclear cases. The CiDA extension (CiD with adversarial learning) examined ten candidates with zero refutations, appearing more novel within this search window. These statistics reflect a focused semantic search, not exhaustive coverage of all distillation or VAE literature.

Based on the top-21 semantic matches examined, the work appears to introduce meaningful technical variations—particularly the 16-channel VAE and adversarial-augmented distillation—though the limited scope means potentially relevant prior work in broader diffusion or VAE research may remain unexamined. The analysis captures the paper's position within a moderately active one-step acceleration subfield but cannot definitively assess novelty against the entire diffusion super-resolution landscape.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
21
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: real-world image super-resolution with diffusion models. The field has evolved into several distinct branches addressing complementary challenges. Inference Acceleration and Efficiency focuses on reducing the computational burden of iterative diffusion sampling, with methods ranging from one-step distillation approaches like GenDR[0] and One-step Effective[19] to flow-based acceleration techniques such as Flow Trajectory Distillation[28]. Fidelity and Structure Preservation emphasizes maintaining faithful reconstruction of input details, while Perceptual Quality Enhancement targets visually pleasing outputs that may trade pixel-level accuracy for realism. Semantic and Content Awareness incorporates high-level understanding—for instance, Text Prompt Diffusion[10] and Scene Text Diffusion[3] leverage textual or semantic cues to guide restoration. Domain-Specific Applications tailors diffusion models to specialized settings like medical imaging, remote sensing, or video upscaling, whereas Uncertainty and Stochasticity Management explores controlling the inherent randomness in generative processes. Finally, Specialized Degradation Handling addresses complex real-world corruptions beyond simple downsampling, including blur, noise, and compression artifacts. A central tension across these branches is the trade-off between speed and quality: many studies pursue efficient one-step or few-step inference to make diffusion practical, yet risk sacrificing the rich detail that multi-step sampling provides. Within Inference Acceleration, GenDR[0] exemplifies the one-step paradigm by distilling a diffusion prior into a single forward pass, positioning itself alongside works like Visual Perception Distillation[11] and SinSR[12] that similarly compress iterative refinement. Compared to Large-Scale Discriminator[27], which may still rely on adversarial training for realism, or HF-Diff[46], which balances frequency-domain constraints with diffusion steps, GenDR[0] prioritizes minimal latency while aiming to preserve perceptual fidelity. Meanwhile, neighboring efforts such as Transfer VAE[5] and TSD-SR[35] explore alternative one-step architectures or hybrid strategies, highlighting ongoing questions about how best to retain semantic coherence and fine texture when collapsing the diffusion trajectory into a single inference stage.

Claimed Contributions

SD2.1-VAE16: 16-channel VAE for super-resolution

The authors develop SD2.1-VAE16, a diffusion model with a 16-channel variational autoencoder instead of the standard 4-channel VAE. This larger latent space is designed to preserve more details for super-resolution tasks while maintaining computational efficiency through representation alignment training.

1 retrieved paper
Can Refute
Consistent score identity distillation (CiD)

The authors introduce CiD, a step distillation method that integrates super-resolution task-specific losses into score identity distillation. This approach addresses the misalignment between text-to-image and super-resolution objectives by incorporating SR priors and ensuring consistency between training distributions.

10 retrieved papers
Can Refute
CiDA: CiD with adversarial learning and representation alignment

The authors extend CiD by incorporating adversarial learning and representation alignment into the distillation framework. This extension, called CiDA, improves perceptual quality of restored images and speeds up the training process while maintaining detail fidelity.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SD2.1-VAE16: 16-channel VAE for super-resolution

The authors develop SD2.1-VAE16, a diffusion model with a 16-channel variational autoencoder instead of the standard 4-channel VAE. This larger latent space is designed to preserve more details for super-resolution tasks while maintaining computational efficiency through representation alignment training.

Contribution

Consistent score identity distillation (CiD)

The authors introduce CiD, a step distillation method that integrates super-resolution task-specific losses into score identity distillation. This approach addresses the misalignment between text-to-image and super-resolution objectives by incorporating SR priors and ensuring consistency between training distributions.

Contribution

CiDA: CiD with adversarial learning and representation alignment

The authors extend CiD by incorporating adversarial learning and representation alignment into the distillation framework. This extension, called CiDA, improves perceptual quality of restored images and speeds up the training process while maintaining detail fidelity.