Diffusion Transformers with Representation Autoencoders

ICLR 2026 Conference SubmissionAnonymous Authors
Generative ModelsDiffusion ModelsRepresentation LearningHigh-dimensional Diffusion
Abstract:

Latent generative modeling has become the standard strategy for Diffusion Transformers (DiTs), but the autoencoder has barely evolved. Most DiTs still use the legacy VAE encoder, which introduces several limitations: large convolutional backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations resulting from purely reconstruction-based training. In this work, we investigate replacing the VAE encoder–decoder with pretrained representation encoders (e.g., DINO, SigLIP, MAE) combined with trained decoders, forming what we call \emph{Representation Autoencoders} (RAEs). These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable transformer-based architecture. A key challenge arises in enabling diffusion transformers to operate effectively within these high-dimensional representations. We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically. Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant with a lightweight wide DDT-head, we demonstrate state-of-the-art image generation performance, reaching FIDs of 1.18 @256 resolution and 1.13 @512 on ImageNet.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Representation Autoencoders (RAEs), which replace traditional VAE encoders with pretrained representation encoders (DINO, SigLIP, MAE) paired with trained decoders. This work resides in the 'Representation Autoencoders' leaf of the taxonomy, which contains only two papers including the original. This sparse population suggests the specific approach of combining pretrained encoders with diffusion transformers is relatively unexplored. The taxonomy shows the broader 'Autoencoder Architecture and Latent Space Design' branch contains five leaves addressing compression, latent properties, masking, unified architectures, and representation autoencoders, indicating moderate activity in autoencoder design overall.

The taxonomy reveals neighboring research directions that contextualize this work. The 'High-Compression and Efficient Autoencoders' leaf (three papers) pursues spatial compression through architectural innovations, while 'Latent Space Properties and Optimization' (four papers) analyzes latent characteristics like smoothness and discriminability. The 'Masked Autoencoder Integration' leaf (one paper) explores masking strategies, and 'Unified End-to-End Architectures' (two papers) merges encoder-decoder-diffusion components. The original paper diverges by emphasizing semantically rich pretrained representations rather than compression ratios or end-to-end unification, carving a distinct niche within the autoencoder design landscape.

Among 21 candidates examined across three contributions, none were found to clearly refute the proposed methods. Contribution A (RAEs) examined 10 candidates with zero refutable matches, suggesting limited direct prior work on this specific encoder-decoder combination. Contribution B (theoretically motivated solutions for high-dimensional diffusion) also examined 10 candidates with no refutations, indicating the theoretical analysis may address gaps in existing literature. Contribution C (DiTDH architecture) examined only one candidate with no overlap. The modest search scope (21 papers) and absence of refutations suggest these contributions occupy relatively novel territory within the examined literature.

Based on the limited search scope of 21 semantically related papers, the work appears to introduce a distinct approach within a sparsely populated research direction. The taxonomy structure confirms that representation autoencoders constitute a small but emerging area, with the original paper and one sibling defining this leaf. However, the analysis does not cover exhaustive literature review or broader architectural surveys, leaving open the possibility of related work outside the top-K semantic matches examined.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
21
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: latent diffusion modeling with representation autoencoders. This field centers on learning compact latent representations that enable efficient diffusion-based generation across diverse data modalities. The taxonomy reveals four main branches that collectively map the landscape. Autoencoder Architecture and Latent Space Design focuses on the structural choices underlying representation autoencoders—ranging from hierarchical designs like Hierarchical Diffusion Autoencoders[20] to specialized compression schemes such as Deep Compression Autoencoder[13] and adaptive encoding strategies like Adaptive Latent Encoding[29]. Representation Learning and Semantic Encoding emphasizes how autoencoders capture meaningful structure, with works like Diffusion Representation Learner[3] and Diffusion Representation Learning[5] exploring semantic disentanglement and interpretability. Generation Applications Across Domains showcases the breadth of modalities tackled—from language (Latent Diffusion Language[4]) and proteins (Latent Diffusion Protein[6], ProteinAE Diffusion[37]) to DNA sequences (Latent Diffusion DNA[10]) and 3D geometry (Geometric Latent Diffusion[19]). Specialized Applications and Analysis addresses niche use cases and analytical perspectives, including medical imaging (Latent Diffusion Medical[12]) and anomaly detection (Lafite Anomaly Detection[24]). A particularly active line of work explores the interplay between autoencoder design and diffusion quality, with studies like Improving Diffusability Autoencoders[8] and Diffusion Bridge AutoEncoders[15] investigating how latent space properties affect generative performance. Another contrasting theme is the tension between compression efficiency and semantic fidelity: while Deep Compression Autoencoder[13] and LiteVAE[30] prioritize compact representations, works like Structured Latent Space[18] and Lost Latent Space[33] examine the trade-offs in preserving interpretable structure. The original paper, Diffusion Transformers Autoencoders[0], sits within the Autoencoder Architecture branch alongside MeanFlow Transformers Autoencoders[14], emphasizing transformer-based architectures for representation learning. Compared to neighboring efforts like Diffusion Masked Autoencoders[2], which integrate masking strategies, Diffusion Transformers Autoencoders[0] appears to focus more directly on leveraging transformer expressiveness to refine latent encodings for diffusion, positioning it at the intersection of architectural innovation and representation quality.

Claimed Contributions

Representation Autoencoders (RAEs)

The authors propose Representation Autoencoders (RAEs), which replace traditional VAE encoders with frozen pretrained representation encoders (such as DINOv2, SigLIP, or MAE) paired with lightweight learned decoders. RAEs provide high-quality reconstructions and semantically rich latent spaces while using a scalable transformer-based architecture.

10 retrieved papers
Theoretically motivated solutions for high-dimensional diffusion

The authors identify and address three key challenges in enabling diffusion transformers to operate effectively in high-dimensional RAE latent spaces: transformer width must match token dimensionality, noise scheduling must be dimension-dependent, and decoders require noise-augmented training. These solutions are supported by theoretical analysis and empirical validation.

10 retrieved papers
DiTDH architecture with wide DDT head

The authors introduce DiTDH, an augmented DiT architecture that incorporates a wide, shallow transformer module (DDT head) dedicated to denoising. This design provides sufficient model width for high-dimensional diffusion without scaling the entire backbone, achieving faster convergence and state-of-the-art generation performance.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Representation Autoencoders (RAEs)

The authors propose Representation Autoencoders (RAEs), which replace traditional VAE encoders with frozen pretrained representation encoders (such as DINOv2, SigLIP, or MAE) paired with lightweight learned decoders. RAEs provide high-quality reconstructions and semantically rich latent spaces while using a scalable transformer-based architecture.

Contribution

Theoretically motivated solutions for high-dimensional diffusion

The authors identify and address three key challenges in enabling diffusion transformers to operate effectively in high-dimensional RAE latent spaces: transformer width must match token dimensionality, noise scheduling must be dimension-dependent, and decoders require noise-augmented training. These solutions are supported by theoretical analysis and empirical validation.

Contribution

DiTDH architecture with wide DDT head

The authors introduce DiTDH, an augmented DiT architecture that incorporates a wide, shallow transformer module (DDT head) dedicated to denoising. This design provides sufficient model width for high-dimensional diffusion without scaling the entire backbone, achieving faster convergence and state-of-the-art generation performance.