Diffusion Transformers with Representation Autoencoders

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Generative ModelsDiffusion ModelsRepresentation LearningHigh-dimensional Diffusion

Latent generative modeling has become the standard strategy for Diffusion Transformers (DiTs), but the autoencoder has barely evolved. Most DiTs still use the legacy VAE encoder, which introduces several limitations: large convolutional backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations resulting from purely reconstruction-based training. In this work, we investigate replacing the VAE encoder–decoder with pretrained representation encoders (e.g., DINO, SigLIP, MAE) combined with trained decoders, forming what we call \emph{Representation Autoencoders} (RAEs). These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable transformer-based architecture. A key challenge arises in enabling diffusion transformers to operate effectively within these high-dimensional representations. We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically. Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant with a lightweight wide DDT-head, we demonstrate state-of-the-art image generation performance, reaching FIDs of 1.18 @256 resolution and 1.13 @512 on ImageNet.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Representation Autoencoders (RAEs), which replace traditional VAE encoders with pretrained representation encoders (DINO, SigLIP, MAE) paired with trained decoders. This work resides in the 'Representation Autoencoders' leaf of the taxonomy, which contains only two papers including the original. This sparse population suggests the specific approach of combining pretrained encoders with diffusion transformers is relatively unexplored. The taxonomy shows the broader 'Autoencoder Architecture and Latent Space Design' branch contains five leaves addressing compression, latent properties, masking, unified architectures, and representation autoencoders, indicating moderate activity in autoencoder design overall.

The taxonomy reveals neighboring research directions that contextualize this work. The 'High-Compression and Efficient Autoencoders' leaf (three papers) pursues spatial compression through architectural innovations, while 'Latent Space Properties and Optimization' (four papers) analyzes latent characteristics like smoothness and discriminability. The 'Masked Autoencoder Integration' leaf (one paper) explores masking strategies, and 'Unified End-to-End Architectures' (two papers) merges encoder-decoder-diffusion components. The original paper diverges by emphasizing semantically rich pretrained representations rather than compression ratios or end-to-end unification, carving a distinct niche within the autoencoder design landscape.

Among 21 candidates examined across three contributions, none were found to clearly refute the proposed methods. Contribution A (RAEs) examined 10 candidates with zero refutable matches, suggesting limited direct prior work on this specific encoder-decoder combination. Contribution B (theoretically motivated solutions for high-dimensional diffusion) also examined 10 candidates with no refutations, indicating the theoretical analysis may address gaps in existing literature. Contribution C (DiTDH architecture) examined only one candidate with no overlap. The modest search scope (21 papers) and absence of refutations suggest these contributions occupy relatively novel territory within the examined literature.

Based on the limited search scope of 21 semantically related papers, the work appears to introduce a distinct approach within a sparsely populated research direction. The taxonomy structure confirms that representation autoencoders constitute a small but emerging area, with the original paper and one sibling defining this leaf. However, the analysis does not cover exhaustive literature review or broader architectural surveys, leaving open the possibility of related work outside the top-K semantic matches examined.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: latent diffusion modeling with representation autoencoders. This field centers on learning compact latent representations that enable efficient diffusion-based generation across diverse data modalities. The taxonomy reveals four main branches that collectively map the landscape. Autoencoder Architecture and Latent Space Design focuses on the structural choices underlying representation autoencoders—ranging from hierarchical designs like Hierarchical Diffusion Autoencoders[20] to specialized compression schemes such as Deep Compression Autoencoder[13] and adaptive encoding strategies like Adaptive Latent Encoding[29]. Representation Learning and Semantic Encoding emphasizes how autoencoders capture meaningful structure, with works like Diffusion Representation Learner[3] and Diffusion Representation Learning[5] exploring semantic disentanglement and interpretability. Generation Applications Across Domains showcases the breadth of modalities tackled—from language (Latent Diffusion Language[4]) and proteins (Latent Diffusion Protein[6], ProteinAE Diffusion[37]) to DNA sequences (Latent Diffusion DNA[10]) and 3D geometry (Geometric Latent Diffusion[19]). Specialized Applications and Analysis addresses niche use cases and analytical perspectives, including medical imaging (Latent Diffusion Medical[12]) and anomaly detection (Lafite Anomaly Detection[24]). A particularly active line of work explores the interplay between autoencoder design and diffusion quality, with studies like Improving Diffusability Autoencoders[8] and Diffusion Bridge AutoEncoders[15] investigating how latent space properties affect generative performance. Another contrasting theme is the tension between compression efficiency and semantic fidelity: while Deep Compression Autoencoder[13] and LiteVAE[30] prioritize compact representations, works like Structured Latent Space[18] and Lost Latent Space[33] examine the trade-offs in preserving interpretable structure. The original paper, Diffusion Transformers Autoencoders[0], sits within the Autoencoder Architecture branch alongside MeanFlow Transformers Autoencoders[14], emphasizing transformer-based architectures for representation learning. Compared to neighboring efforts like Diffusion Masked Autoencoders[2], which integrate masking strategies, Diffusion Transformers Autoencoders[0] appears to focus more directly on leveraging transformer expressiveness to refine latent encodings for diffusion, positioning it at the intersection of architectural innovation and representation quality.

Claimed Contributions

Representation Autoencoders (RAEs)

10 retrieved papers

The authors propose Representation Autoencoders (RAEs), which replace traditional VAE encoders with frozen pretrained representation encoders (such as DINOv2, SigLIP, or MAE) paired with lightweight learned decoders. RAEs provide high-quality reconstructions and semantically rich latent spaces while using a scalable transformer-based architecture.

10 retrieved papers

Theoretically motivated solutions for high-dimensional diffusion

10 retrieved papers

The authors identify and address three key challenges in enabling diffusion transformers to operate effectively in high-dimensional RAE latent spaces: transformer width must match token dimensionality, noise scheduling must be dimension-dependent, and decoders require noise-augmented training. These solutions are supported by theoretical analysis and empirical validation.

10 retrieved papers

DiTDH architecture with wide DDT head

1 retrieved paper

The authors introduce DiTDH, an augmented DiT architecture that incorporates a wide, shallow transformer module (DDT head) dedicated to denoising. This design provides sufficient model width for high-dimensional diffusion without scaling the entire backbone, achieving faster convergence and state-of-the-art generation performance.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[14] MeanFlow Transformers with Representation Autoencoders PDF

Zheyuan Hu, Chieh-Hsin Lai, Ge Wu, Yuki Mitsufuji, Stefano Ermon (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Representation Autoencoders (RAEs)

[51] Omni-id: Holistic identity representation designed for generative tasks PDF

Cannot Refute

[52] Context autoencoder for self-supervised representation learning PDF

Cannot Refute

[53] Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation PDF

Cannot Refute

[54] Deep learning for tomographic image reconstruction PDF

Cannot Refute

[55] EdgeRunner: Auto-regressive Auto-encoder for Artistic Mesh Generation PDF

Cannot Refute

[56] Unsupervised representation learning from pre-trained diffusion probabilistic models PDF

Cannot Refute

[57] Inverge: Intelligent visual encoder for bridging modalities in report generation PDF

Cannot Refute

[58] End-to-End Object Detection with Transformers PDF

Cannot Refute

[59] Mofa: Model-based deep convolutional face autoencoder for unsupervised monocular reconstruction PDF

Cannot Refute

[60] Unsupervised knowledge-transfer for learned image reconstruction PDF

Cannot Refute

Contribution

Theoretically motivated solutions for high-dimensional diffusion

[61] Diffusion models with learned adaptive noise PDF

Cannot Refute

[62] Diffusion-based Large Language Models Survey PDF

Cannot Refute

[63] Trans-Dimensional Generative Modeling via Jump Diffusion Models PDF

Cannot Refute

[64] Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation PDF

Cannot Refute

[65] Ant: Adaptive noise schedule for time series diffusion models PDF

Cannot Refute

[66] Priority-Centric Human Motion Generation in Discrete Latent Space PDF

Cannot Refute

[67] Renormalization group flow, optimal transport, and diffusion-based generative model. PDF

Cannot Refute

[68] Rethinking the noise schedule of diffusion-based generative models PDF

Cannot Refute

[69] Continuous-time Discrete-space Diffusion Model for Recommendation PDF

Cannot Refute

[70] Channel-wise Noise Scheduled Diffusion for Inverse Rendering in Indoor Scenes PDF

Cannot Refute

Contribution

DiTDH architecture with wide DDT head

[71] TinyFusion: Diffusion Transformers Learned Shallow PDF

Cannot Refute

Diffusion Transformers with Representation Autoencoders

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[14] MeanFlow Transformers with Representation Autoencoders PDF

Contribution Analysis

Representation Autoencoders (RAEs)

[51] Omni-id: Holistic identity representation designed for generative tasks PDF

[52] Context autoencoder for self-supervised representation learning PDF

[53] Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation PDF

[54] Deep learning for tomographic image reconstruction PDF

[55] EdgeRunner: Auto-regressive Auto-encoder for Artistic Mesh Generation PDF

[56] Unsupervised representation learning from pre-trained diffusion probabilistic models PDF

[57] Inverge: Intelligent visual encoder for bridging modalities in report generation PDF

[58] End-to-End Object Detection with Transformers PDF

[59] Mofa: Model-based deep convolutional face autoencoder for unsupervised monocular reconstruction PDF

[60] Unsupervised knowledge-transfer for learned image reconstruction PDF

Theoretically motivated solutions for high-dimensional diffusion

[61] Diffusion models with learned adaptive noise PDF

[62] Diffusion-based Large Language Models Survey PDF

[63] Trans-Dimensional Generative Modeling via Jump Diffusion Models PDF

[64] Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation PDF

[65] Ant: Adaptive noise schedule for time series diffusion models PDF

[66] Priority-Centric Human Motion Generation in Discrete Latent Space PDF

[67] Renormalization group flow, optimal transport, and diffusion-based generative model. PDF

[68] Rethinking the noise schedule of diffusion-based generative models PDF

[69] Continuous-time Discrete-space Diffusion Model for Recommendation PDF

[70] Channel-wise Noise Scheduled Diffusion for Inverse Rendering in Indoor Scenes PDF

DiTDH architecture with wide DDT head

[71] TinyFusion: Diffusion Transformers Learned Shallow PDF

Table of Contents