Abstract:

Representation alignment helps generation by distilling representations from a pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question - `what aspect of the target representation matters for generation, its global information (measured by Imagenet1K accuracy) or its spatial structure (pairwise cosine similarity between patch tokens)''? Prevalent wisdom holds that stronger global performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising - spatial structure, rather than global performance drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of spatial information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in <4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA-E and meanflow with REPA). Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates whether spatial structure or global performance of pretrained vision encoders drives representation alignment effectiveness in diffusion models. It sits within the Feature Space Alignment Foundations leaf, which contains three papers establishing foundational alignment techniques for image generation. This leaf is part of the broader Image Generation Alignment subtopic under Core Representation Alignment Methods. The taxonomy shows this is a moderately populated research direction, with sibling papers exploring complementary aspects of feature space alignment but not directly addressing the spatial-versus-global question posed here.

The taxonomy reveals that neighboring leaves emphasize different alignment dimensions: Spatial Structure Emphasis focuses on preserving local feature correspondence, while Multimodal Representation Fusion integrates multiple modalities. The original paper bridges these directions by empirically demonstrating that spatial structure—not global accuracy—predicts alignment success. The taxonomy's scope_note for Feature Space Alignment Foundations explicitly excludes spatial structure emphasis, suggesting the paper's findings challenge existing categorical boundaries. Related branches like Text-Visual Alignment Enhancement and Inference-Time Alignment address orthogonal concerns (prompt fidelity, post-training guidance) rather than the fundamental encoder property question examined here.

Among 30 candidates examined across three contributions, none clearly refute the paper's claims. The large-scale empirical analysis (10 candidates examined, 0 refutable) appears novel in systematically comparing 27 encoders on spatial versus global metrics. The Spatial Structure Metric contribution (10 candidates, 0 refutable) introduces a predictive measure not found in examined prior work. The iREPA training recipe (10 candidates, 0 refutable) proposes simple architectural modifications—convolution layers and spatial normalization—that accentuate spatial transfer. The limited search scope means these findings reflect novelty within top-30 semantic matches, not exhaustive field coverage.

Based on the limited literature search, the paper appears to address an underexplored question within representation alignment: which encoder properties matter most. The taxonomy structure shows the field has organized around alignment mechanisms and modalities but less around encoder selection principles. The empirical scale (27 encoders) and the simplicity of the proposed modifications (under 4 lines of code) suggest practical contributions, though the analysis cannot confirm whether similar spatial-versus-global comparisons exist in work outside the examined candidates.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: representation alignment for diffusion model training. The field has organized itself around several major branches that reflect different stages and modalities of alignment. Core Representation Alignment Methods establish foundational techniques for matching feature spaces during training, often focusing on image generation and cross-modal consistency. Text-Visual Alignment Enhancement addresses the challenge of faithfully translating textual descriptions into visual outputs, with works like Text-image Alignment[3] and Long-text Alignment[9] tackling prompt fidelity at different scales. Inference-Time Alignment and Post-Training Optimization branches explore how to refine alignment after initial training, using techniques such as Direct Preference Optimization[6] and reward-guided generation. Domain Adaptation and Generalization methods extend alignment strategies across different data distributions, while Specialized Alignment Applications target specific modalities like video, audio, or 3D content. Theoretical and Survey Perspectives provide overarching frameworks, as seen in Preference Alignment Survey[28] and related tutorial works. Within the dense Core Representation Alignment Methods branch, a key tension emerges between global feature matching and spatially-aware alignment strategies. Global or Spatial[0] sits at the intersection of these approaches within the Feature Space Alignment Foundations cluster, exploring how different granularities of alignment affect generation quality. This contrasts with nearby works like Representation Alignment Generation[1], which emphasizes end-to-end learned alignment mechanisms, and Representation Learner[27], which focuses on discovering alignment structures from data. Cross-frame Alignment[2] extends similar principles to temporal consistency in video generation. The original paper's emphasis on spatial versus global trade-offs positions it as addressing a fundamental design choice that ripples through many downstream applications, from text-to-image synthesis to domain transfer tasks.

Claimed Contributions

Large-scale empirical analysis showing spatial structure drives representation alignment effectiveness

The authors conduct extensive experiments across 27 vision encoders and multiple model scales, demonstrating that spatial self-similarity structure (measured by metrics like LDS) correlates much more strongly with generation performance than global semantic information (measured by ImageNet-1K accuracy). This challenges the prevailing assumption that better global performance leads to better generation.

10 retrieved papers
Spatial Structure Metric (SSM) for predicting representation alignment performance

The authors propose several metrics to quantify spatial self-similarity structure between patch tokens, including LDS (local-distant similarity), which measures how cosine similarity varies with spatial distance. These metrics achieve Pearson correlation above 0.85 with generation FID, far exceeding the 0.26 correlation of linear probing accuracy.

10 retrieved papers
iREPA: improved training recipe accentuating spatial feature transfer

The authors introduce iREPA, which replaces the standard MLP projection layer with a convolutional layer and adds a spatial normalization layer to enhance spatial feature transfer. This simple modification (implemented in fewer than 4 lines of code) consistently improves convergence speed across diverse encoders, model sizes, and training variants including REPA-E and Meanflow with REPA.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Large-scale empirical analysis showing spatial structure drives representation alignment effectiveness

The authors conduct extensive experiments across 27 vision encoders and multiple model scales, demonstrating that spatial self-similarity structure (measured by metrics like LDS) correlates much more strongly with generation performance than global semantic information (measured by ImageNet-1K accuracy). This challenges the prevailing assumption that better global performance leads to better generation.

Contribution

Spatial Structure Metric (SSM) for predicting representation alignment performance

The authors propose several metrics to quantify spatial self-similarity structure between patch tokens, including LDS (local-distant similarity), which measures how cosine similarity varies with spatial distance. These metrics achieve Pearson correlation above 0.85 with generation FID, far exceeding the 0.26 correlation of linear probing accuracy.

Contribution

iREPA: improved training recipe accentuating spatial feature transfer

The authors introduce iREPA, which replaces the standard MLP projection layer with a convolutional layer and adds a spatial normalization layer to enhance spatial feature transfer. This simple modification (implemented in fewer than 4 lines of code) consistently improves convergence speed across diverse encoders, model sizes, and training variants including REPA-E and Meanflow with REPA.