Latent Diffusion Model without Variational Autoencoder

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

generative modeldeep learningself-supervised learning

Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with Variational Autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+Diffusion paradigm still suffers from limited training and inference efficiency, along with poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are not only crucial for perception and understanding tasks, but also equally essential for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG—a novel latent diffusion model without variational autoencoders, which unleashes Self-supervised representations for Visual Generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SVG, a latent diffusion model that replaces VAE encoders with frozen DINO features augmented by a lightweight residual branch for detail capture. It resides in the 'Unified Representation-Generation Frameworks' leaf, which contains five papers exploring joint optimization of representation learning and generative modeling. This leaf sits within the broader 'Self-Supervised Representation Learning for Diffusion Models' branch, indicating a moderately active research direction focused on integrating self-supervised objectives directly into diffusion architectures rather than treating encoding and generation as separate stages.

The taxonomy reveals several neighboring approaches. 'Pretrained Encoder Integration' (one paper) and 'Latent Space Stabilization' (one paper) under 'Latent Space Design and Autoencoding' explore similar themes of leveraging discriminative encoders, but focus on stabilization rather than eliminating VAEs entirely. 'Masked Modeling Approaches in Latent Space' (three papers) combines masked reconstruction with latent diffusion, while 'Self-Supervised Pretraining Strategies' (four papers) emphasizes contrastive or generative pretraining before diffusion training. SVG diverges by directly using frozen DINO features as the primary latent space, bypassing both VAE training and masked modeling paradigms.

Among 30 candidates examined, each of the three contributions shows at least one refutable candidate. For 'SVG: latent diffusion model without variational autoencoders', 10 candidates were examined with 1 appearing to provide overlapping prior work. The same pattern holds for 'Analysis of VAE latent space limitations' and 'Unified feature space for multiple vision tasks', each with 10 candidates examined and 1 refutable match. This suggests that within the limited search scope, some prior work addresses similar architectural choices or latent space critiques, though the majority of examined papers do not directly overlap.

Given the search examined 30 semantically similar papers rather than an exhaustive corpus, the analysis captures immediate neighbors but may miss distant or less-cited precedents. The taxonomy structure shows this is a moderately populated area with clear sibling work in unified frameworks, yet the specific combination of frozen DINO features and VAE elimination appears less common among the examined candidates. The refutable matches indicate incremental positioning relative to existing encoder-free or encoder-aligned diffusion methods, though the precise degree of novelty depends on details not fully captured by top-K semantic retrieval.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: latent diffusion models using self-supervised visual representations. The field has evolved around several complementary directions. Self-Supervised Representation Learning for Diffusion Models explores how to train or adapt diffusion architectures using self-supervised objectives, often unifying representation extraction with generative modeling. Diffusion Features for Discriminative Tasks investigates repurposing pretrained diffusion features for downstream recognition, segmentation, or correspondence problems. Domain-Specific Diffusion Applications tailors diffusion pipelines to specialized modalities such as audio, 3D geometry, or medical imaging. Conditional Generation and Control focuses on steering generation via text, layout, or other guidance signals, while Self-Supervised Reconstruction and Restoration addresses tasks like denoising and inpainting without paired supervision. Finally, Latent Space Design and Autoencoding examines how to construct and stabilize the latent representations that diffusion models operate on, including alternatives to traditional VAE bottlenecks. A particularly active line of work seeks to merge representation learning with diffusion training, reducing reliance on separate autoencoder stages. Diffusion without VAE[0] exemplifies this trend by directly learning latent codes through self-supervised diffusion objectives, closely related to efforts like Self-Supervised DiT[5] and SODA[17], which also integrate representation discovery into the generative process. Meanwhile, works such as Aligning Foundation Encoders[37] and USP[27] explore how pretrained vision encoders can be aligned or adapted for diffusion latent spaces, offering a middle ground between end-to-end training and fixed VAE pipelines. Compared to these neighbors, Diffusion without VAE[0] emphasizes eliminating the VAE altogether, whereas Aligning Foundation Encoders[37] retains separate encoder modules but seeks better compatibility with diffusion dynamics. This cluster highlights an ongoing tension between architectural simplicity and the flexibility of modular encoder-decoder designs.

Claimed Contributions

SVG: latent diffusion model without variational autoencoders

Can Refute

10 retrieved papers

The authors propose SVG, a new latent diffusion framework that replaces the conventional VAE+Diffusion paradigm by constructing a feature space using frozen DINO features augmented with a lightweight residual branch. This approach enables more efficient diffusion training while preserving semantic discriminability.

10 retrieved papers

Can Refute

Analysis of VAE latent space limitations for diffusion models

Can Refute

10 retrieved papers

The authors systematically analyze mainstream VAE latent spaces and demonstrate that the lack of clear semantic separation and discriminative structure in VAE latents hinders efficient diffusion model training, motivating the need for semantically structured feature spaces.

10 retrieved papers

Can Refute

Unified feature space for multiple vision tasks

Can Refute

10 retrieved papers

The authors demonstrate that SVG constructs a unified feature space that retains the potential to support diverse core vision tasks beyond generation, including perception and understanding, while simultaneously enabling high-quality visual generation.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[17] SODA: Bottleneck Diffusion Models for Representation Learning PDF

Drew A. Hudson, Daniel Zoran, Mateusz Malinowski, Andrew K. Lampinen, Andrew Jaegle, Andrew Kyle Lampinen, James L. McClelland, LoÃ¯c Matthey, LÃ¶Ä±c Matthey, James L McClelland, Felix Hill, L. Matthey, Alexander Lerchner (2023) • Computer Vision and Pattern Recognition

[27] USP: Unified Self-Supervised Pretraining for Image Generation and Understanding PDF

Chu, Xiangxiang, LI Renda, Xiangxiang Chu, Wang Yong, Renda Li, Yong Wang (2025) • arXiv.org

[29] SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder PDF

Minglei Shi, Haolin Wang, Borui Zhang, Wenzhao Zheng, Bohan Zeng, Ziyang Yuan, Xiaoshi Wu, Yuanxing Zhang, Huan Yang, Xintao Wang, Pengfei Wan, Kun Gai, Jie Zhou, Jiwen Lu (2025)

[37] Aligning visual foundation encoders to tokenizers for diffusion models PDF

Chen Bowei, Bi, Sai, Bowei Chen, Tan Hao, Sai Bi, Zhang He, Hao Tan, Zhang TianYuan, He Zhang, Li, Zhengqi, Tianyuan Zhang, Xiong, Yuanjun, Zhengqi Li, Zhang Jianming, Yuanjun Xiong, Zhang Kai, Jianming Zhang, Kai Zhang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SVG: latent diffusion model without variational autoencoders

[51] Diffusion transformers with representation autoencoders PDF

Can Refute

[3] Deconstructing denoising diffusion models for self-supervised learning PDF

Cannot Refute

[5] Sd-dit: Unleashing the power of self-supervised discrimination in diffusion transformer PDF

Cannot Refute

[28] Crossdiff: Exploring self-supervised representation of pansharpening via cross-predictive diffusion model PDF

Cannot Refute

[52] Voice-to-Face Generation: Couple of Self-Supervised Representation Learning with Diffusion Model PDF

Cannot Refute

[53] Automated Learning of Semantic Embedding Representations for Diffusion Models PDF

Cannot Refute

[54] Diffusion adversarial representation learning for self-supervised vessel segmentation PDF

Cannot Refute

[55] Diffusion based representation learning PDF

Cannot Refute

[56] Denoising diffusion models for anomaly localization in medical images PDF

Cannot Refute

[57] Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video PDF

Cannot Refute

Contribution

Analysis of VAE latent space limitations for diffusion models

[68] Exploring representation-aligned latent space for better generation PDF

Can Refute

[53] Automated Learning of Semantic Embedding Representations for Diffusion Models PDF

Cannot Refute

[69] Bridging generative and discriminative models for unified visual perception with diffusion priors PDF

Cannot Refute

[70] Litevae: Lightweight and efficient variational autoencoders for latent diffusion models PDF

Cannot Refute

[71] Enhanced medical image generation through advanced latent space diffusion PDF

Cannot Refute

[72] Latent diffusion model-enabled low-latency semantic communication in the presence of semantic ambiguities and wireless channel noises PDF

Cannot Refute

[73] Contrastive conditional latent diffusion for audio-visual segmentation PDF

Cannot Refute

[74] Scenefactor: Factored latent 3d diffusion for controllable 3d scene generation PDF

Cannot Refute

[75] Latent Diffusion Autoencoders: Toward Efficient and Meaningful Unsupervised Representation Learning in Medical Imaging PDF

Cannot Refute

[76] DDAE++: Enhancing Diffusion Models Towards Unified Generative and Discriminative Learning PDF

Cannot Refute

Contribution

Unified feature space for multiple vision tasks

[59] Tokenflow: Unified image tokenizer for multimodal understanding and generation PDF

Can Refute

[58] Vila-u: a unified foundation model integrating visual understanding and generation PDF

Cannot Refute

[60] Gpt4point: A unified framework for point-language understanding and generation PDF

Cannot Refute

[61] Haploomni: Unified single transformer for multimodal video understanding and generation PDF

Cannot Refute

[62] Multimodal intelligence: Representation learning, information fusion, and applications PDF

Cannot Refute

[63] Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning PDF

Cannot Refute

[64] Towards unified bijective image-text generation for text-to-image person re-identification PDF

Cannot Refute

[65] Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing PDF

Cannot Refute

[66] Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation PDF

Cannot Refute

[67] Towards more unified in-context visual understanding PDF

Cannot Refute

Latent Diffusion Model without Variational Autoencoder

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[17] SODA: Bottleneck Diffusion Models for Representation Learning PDF

[27] USP: Unified Self-Supervised Pretraining for Image Generation and Understanding PDF

[29] SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder PDF

[37] Aligning visual foundation encoders to tokenizers for diffusion models PDF

Contribution Analysis

SVG: latent diffusion model without variational autoencoders

[51] Diffusion transformers with representation autoencoders PDF

[3] Deconstructing denoising diffusion models for self-supervised learning PDF

[5] Sd-dit: Unleashing the power of self-supervised discrimination in diffusion transformer PDF

[28] Crossdiff: Exploring self-supervised representation of pansharpening via cross-predictive diffusion model PDF

[52] Voice-to-Face Generation: Couple of Self-Supervised Representation Learning with Diffusion Model PDF

[53] Automated Learning of Semantic Embedding Representations for Diffusion Models PDF

[54] Diffusion adversarial representation learning for self-supervised vessel segmentation PDF

[55] Diffusion based representation learning PDF

[56] Denoising diffusion models for anomaly localization in medical images PDF

[57] Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video PDF

Analysis of VAE latent space limitations for diffusion models

[68] Exploring representation-aligned latent space for better generation PDF

[53] Automated Learning of Semantic Embedding Representations for Diffusion Models PDF

[69] Bridging generative and discriminative models for unified visual perception with diffusion priors PDF

[70] Litevae: Lightweight and efficient variational autoencoders for latent diffusion models PDF

[71] Enhanced medical image generation through advanced latent space diffusion PDF

[72] Latent diffusion model-enabled low-latency semantic communication in the presence of semantic ambiguities and wireless channel noises PDF

[73] Contrastive conditional latent diffusion for audio-visual segmentation PDF

[74] Scenefactor: Factored latent 3d diffusion for controllable 3d scene generation PDF

[75] Latent Diffusion Autoencoders: Toward Efficient and Meaningful Unsupervised Representation Learning in Medical Imaging PDF

[76] DDAE++: Enhancing Diffusion Models Towards Unified Generative and Discriminative Learning PDF

Unified feature space for multiple vision tasks

[59] Tokenflow: Unified image tokenizer for multimodal understanding and generation PDF

[58] Vila-u: a unified foundation model integrating visual understanding and generation PDF

[60] Gpt4point: A unified framework for point-language understanding and generation PDF

[61] Haploomni: Unified single transformer for multimodal video understanding and generation PDF

[62] Multimodal intelligence: Representation learning, information fusion, and applications PDF

[63] Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning PDF

[64] Towards unified bijective image-text generation for text-to-image person re-identification PDF

[65] Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing PDF

[66] Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation PDF

[67] Towards more unified in-context visual understanding PDF

Table of Contents