Latent Diffusion Model without Variational Autoencoder

ICLR 2026 Conference SubmissionAnonymous Authors
generative modeldeep learningself-supervised learning
Abstract:

Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with Variational Autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+Diffusion paradigm still suffers from limited training and inference efficiency, along with poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are not only crucial for perception and understanding tasks, but also equally essential for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG—a novel latent diffusion model without variational autoencoders, which unleashes Self-supervised representations for Visual Generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SVG, a latent diffusion model that replaces VAE encoders with frozen DINO features augmented by a lightweight residual branch for detail capture. It resides in the 'Unified Representation-Generation Frameworks' leaf, which contains five papers exploring joint optimization of representation learning and generative modeling. This leaf sits within the broader 'Self-Supervised Representation Learning for Diffusion Models' branch, indicating a moderately active research direction focused on integrating self-supervised objectives directly into diffusion architectures rather than treating encoding and generation as separate stages.

The taxonomy reveals several neighboring approaches. 'Pretrained Encoder Integration' (one paper) and 'Latent Space Stabilization' (one paper) under 'Latent Space Design and Autoencoding' explore similar themes of leveraging discriminative encoders, but focus on stabilization rather than eliminating VAEs entirely. 'Masked Modeling Approaches in Latent Space' (three papers) combines masked reconstruction with latent diffusion, while 'Self-Supervised Pretraining Strategies' (four papers) emphasizes contrastive or generative pretraining before diffusion training. SVG diverges by directly using frozen DINO features as the primary latent space, bypassing both VAE training and masked modeling paradigms.

Among 30 candidates examined, each of the three contributions shows at least one refutable candidate. For 'SVG: latent diffusion model without variational autoencoders', 10 candidates were examined with 1 appearing to provide overlapping prior work. The same pattern holds for 'Analysis of VAE latent space limitations' and 'Unified feature space for multiple vision tasks', each with 10 candidates examined and 1 refutable match. This suggests that within the limited search scope, some prior work addresses similar architectural choices or latent space critiques, though the majority of examined papers do not directly overlap.

Given the search examined 30 semantically similar papers rather than an exhaustive corpus, the analysis captures immediate neighbors but may miss distant or less-cited precedents. The taxonomy structure shows this is a moderately populated area with clear sibling work in unified frameworks, yet the specific combination of frozen DINO features and VAE elimination appears less common among the examined candidates. The refutable matches indicate incremental positioning relative to existing encoder-free or encoder-aligned diffusion methods, though the precise degree of novelty depends on details not fully captured by top-K semantic retrieval.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: latent diffusion models using self-supervised visual representations. The field has evolved around several complementary directions. Self-Supervised Representation Learning for Diffusion Models explores how to train or adapt diffusion architectures using self-supervised objectives, often unifying representation extraction with generative modeling. Diffusion Features for Discriminative Tasks investigates repurposing pretrained diffusion features for downstream recognition, segmentation, or correspondence problems. Domain-Specific Diffusion Applications tailors diffusion pipelines to specialized modalities such as audio, 3D geometry, or medical imaging. Conditional Generation and Control focuses on steering generation via text, layout, or other guidance signals, while Self-Supervised Reconstruction and Restoration addresses tasks like denoising and inpainting without paired supervision. Finally, Latent Space Design and Autoencoding examines how to construct and stabilize the latent representations that diffusion models operate on, including alternatives to traditional VAE bottlenecks. A particularly active line of work seeks to merge representation learning with diffusion training, reducing reliance on separate autoencoder stages. Diffusion without VAE[0] exemplifies this trend by directly learning latent codes through self-supervised diffusion objectives, closely related to efforts like Self-Supervised DiT[5] and SODA[17], which also integrate representation discovery into the generative process. Meanwhile, works such as Aligning Foundation Encoders[37] and USP[27] explore how pretrained vision encoders can be aligned or adapted for diffusion latent spaces, offering a middle ground between end-to-end training and fixed VAE pipelines. Compared to these neighbors, Diffusion without VAE[0] emphasizes eliminating the VAE altogether, whereas Aligning Foundation Encoders[37] retains separate encoder modules but seeks better compatibility with diffusion dynamics. This cluster highlights an ongoing tension between architectural simplicity and the flexibility of modular encoder-decoder designs.

Claimed Contributions

SVG: latent diffusion model without variational autoencoders

The authors propose SVG, a new latent diffusion framework that replaces the conventional VAE+Diffusion paradigm by constructing a feature space using frozen DINO features augmented with a lightweight residual branch. This approach enables more efficient diffusion training while preserving semantic discriminability.

10 retrieved papers
Can Refute
Analysis of VAE latent space limitations for diffusion models

The authors systematically analyze mainstream VAE latent spaces and demonstrate that the lack of clear semantic separation and discriminative structure in VAE latents hinders efficient diffusion model training, motivating the need for semantically structured feature spaces.

10 retrieved papers
Can Refute
Unified feature space for multiple vision tasks

The authors demonstrate that SVG constructs a unified feature space that retains the potential to support diverse core vision tasks beyond generation, including perception and understanding, while simultaneously enabling high-quality visual generation.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SVG: latent diffusion model without variational autoencoders

The authors propose SVG, a new latent diffusion framework that replaces the conventional VAE+Diffusion paradigm by constructing a feature space using frozen DINO features augmented with a lightweight residual branch. This approach enables more efficient diffusion training while preserving semantic discriminability.

Contribution

Analysis of VAE latent space limitations for diffusion models

The authors systematically analyze mainstream VAE latent spaces and demonstrate that the lack of clear semantic separation and discriminative structure in VAE latents hinders efficient diffusion model training, motivating the need for semantically structured feature spaces.

Contribution

Unified feature space for multiple vision tasks

The authors demonstrate that SVG constructs a unified feature space that retains the potential to support diverse core vision tasks beyond generation, including perception and understanding, while simultaneously enabling high-quality visual generation.