What matters for Representation Alignment: Global Information or Spatial Structure?

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

reparepresentation learningrepa-e

Representation alignment helps generation by distilling representations from a pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question - `what aspect of the target representation matters for generation, its global information (measured by Imagenet1K accuracy) or its spatial structure (pairwise cosine similarity between patch tokens)''? Prevalent wisdom holds that stronger global performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising - spatial structure, rather than global performance drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of spatial information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in <4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA-E and meanflow with REPA). Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates whether spatial structure or global performance of pretrained vision encoders drives representation alignment effectiveness in diffusion models. It sits within the Feature Space Alignment Foundations leaf, which contains three papers establishing foundational alignment techniques for image generation. This leaf is part of the broader Image Generation Alignment subtopic under Core Representation Alignment Methods. The taxonomy shows this is a moderately populated research direction, with sibling papers exploring complementary aspects of feature space alignment but not directly addressing the spatial-versus-global question posed here.

The taxonomy reveals that neighboring leaves emphasize different alignment dimensions: Spatial Structure Emphasis focuses on preserving local feature correspondence, while Multimodal Representation Fusion integrates multiple modalities. The original paper bridges these directions by empirically demonstrating that spatial structure—not global accuracy—predicts alignment success. The taxonomy's scope_note for Feature Space Alignment Foundations explicitly excludes spatial structure emphasis, suggesting the paper's findings challenge existing categorical boundaries. Related branches like Text-Visual Alignment Enhancement and Inference-Time Alignment address orthogonal concerns (prompt fidelity, post-training guidance) rather than the fundamental encoder property question examined here.

Among 30 candidates examined across three contributions, none clearly refute the paper's claims. The large-scale empirical analysis (10 candidates examined, 0 refutable) appears novel in systematically comparing 27 encoders on spatial versus global metrics. The Spatial Structure Metric contribution (10 candidates, 0 refutable) introduces a predictive measure not found in examined prior work. The iREPA training recipe (10 candidates, 0 refutable) proposes simple architectural modifications—convolution layers and spatial normalization—that accentuate spatial transfer. The limited search scope means these findings reflect novelty within top-30 semantic matches, not exhaustive field coverage.

Based on the limited literature search, the paper appears to address an underexplored question within representation alignment: which encoder properties matter most. The taxonomy structure shows the field has organized around alignment mechanisms and modalities but less around encoder selection principles. The empirical scale (27 encoders) and the simplicity of the proposed modifications (under 4 lines of code) suggest practical contributions, though the analysis cannot confirm whether similar spatial-versus-global comparisons exist in work outside the examined candidates.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: representation alignment for diffusion model training. The field has organized itself around several major branches that reflect different stages and modalities of alignment. Core Representation Alignment Methods establish foundational techniques for matching feature spaces during training, often focusing on image generation and cross-modal consistency. Text-Visual Alignment Enhancement addresses the challenge of faithfully translating textual descriptions into visual outputs, with works like Text-image Alignment[3] and Long-text Alignment[9] tackling prompt fidelity at different scales. Inference-Time Alignment and Post-Training Optimization branches explore how to refine alignment after initial training, using techniques such as Direct Preference Optimization[6] and reward-guided generation. Domain Adaptation and Generalization methods extend alignment strategies across different data distributions, while Specialized Alignment Applications target specific modalities like video, audio, or 3D content. Theoretical and Survey Perspectives provide overarching frameworks, as seen in Preference Alignment Survey[28] and related tutorial works. Within the dense Core Representation Alignment Methods branch, a key tension emerges between global feature matching and spatially-aware alignment strategies. Global or Spatial[0] sits at the intersection of these approaches within the Feature Space Alignment Foundations cluster, exploring how different granularities of alignment affect generation quality. This contrasts with nearby works like Representation Alignment Generation[1], which emphasizes end-to-end learned alignment mechanisms, and Representation Learner[27], which focuses on discovering alignment structures from data. Cross-frame Alignment[2] extends similar principles to temporal consistency in video generation. The original paper's emphasis on spatial versus global trade-offs positions it as addressing a fundamental design choice that ripples through many downstream applications, from text-to-image synthesis to domain transfer tasks.

Claimed Contributions

Large-scale empirical analysis showing spatial structure drives representation alignment effectiveness

10 retrieved papers

The authors conduct extensive experiments across 27 vision encoders and multiple model scales, demonstrating that spatial self-similarity structure (measured by metrics like LDS) correlates much more strongly with generation performance than global semantic information (measured by ImageNet-1K accuracy). This challenges the prevailing assumption that better global performance leads to better generation.

10 retrieved papers

Spatial Structure Metric (SSM) for predicting representation alignment performance

10 retrieved papers

The authors propose several metrics to quantify spatial self-similarity structure between patch tokens, including LDS (local-distant similarity), which measures how cosine similarity varies with spatial distance. These metrics achieve Pearson correlation above 0.85 with generation FID, far exceeding the 0.26 correlation of linear probing accuracy.

10 retrieved papers

iREPA: improved training recipe accentuating spatial feature transfer

10 retrieved papers

The authors introduce iREPA, which replaces the standard MLP projection layer with a convolutional layer and adds a spatial normalization layer to enhance spatial feature transfer. This simple modification (implemented in fewer than 4 lines of code) consistently improves convergence speed across diverse encoders, model sizes, and training variants including REPA-E and Meanflow with REPA.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Representation alignment for generation: Training diffusion transformers is easier than you think PDF

Yu, Sihyun, Sihyun Yu, Sangkyung Kwak, Jang, Huiwon, Huiwon Jang, Jeong Jong-Heon, Jongheon Jeong, Huang, Jonathan, Jonathan Huang, Shin, Jinwoo, Jinwoo Shin, Xie, Saining, Saining Xie (2024)

[27] Diffusion model as representation learner PDF

Xing-yi Yang, Xin-Chao Wang, Xingyi Yang, Xinchao Wang (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Large-scale empirical analysis showing spatial structure drives representation alignment effectiveness

[61] DocLLM: A layout-aware generative language model for multimodal document understanding PDF

Cannot Refute

[62] Inversion-free image editing with language-guided diffusion models PDF

Cannot Refute

[63] Compositional transformers for scene generation PDF

Cannot Refute

[64] Few shot generative model adaption via relaxed spatial structural alignment PDF

Cannot Refute

[65] DiffusePast: Diffusion-based Generative Replay for Class Incremental Semantic Segmentation PDF

Cannot Refute

[66] Tablegpt: Few-shot table-to-text generation with table structure reconstruction and content matching PDF

Cannot Refute

[67] REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion PDF

Cannot Refute

[68] RecompGPT: Generative Pre-trained Transformers-assisted Human Gaze Pattern Learning and Distribution Modeling for Scene Recomposition PDF

Cannot Refute

[69] MV-MambaNet: multiscale and multiview visual question answering network for 3D medical images PDF

Cannot Refute

[70] HISPACE: Histological Image Synthesis with Pattern And Content Engine PDF

Cannot Refute

Contribution

Spatial Structure Metric (SSM) for predicting representation alignment performance

[51] SP-GEM: Spatial Pattern-Aware Graph Embedding for Matching Multisource Road Networks PDF

Cannot Refute

[52] Multitask representations in the human cortex transform along a sensory-to-motor hierarchy PDF

Cannot Refute

[53] A multiscale road matching method based on hierarchical road meshes PDF

Cannot Refute

[54] Matching the building footprints of different vector spatial datasets at a similar scale based on one-class support vector machines PDF

Cannot Refute

[55] Global optimisation matching method for multi-representation buildings constrained by road network PDF

Cannot Refute

[56] Design and evaluation of algorithms for image retrieval by spatial similarity PDF

Cannot Refute

[57] A Multi-Scale Residential Areas Matching Method Considering Spatial Neighborhood Features PDF

Cannot Refute

[58] Neural pattern similarity reveals the inherent intersection of social categories PDF

Cannot Refute

[59] A survey of measures and methods for matching geospatial vector datasets PDF

Cannot Refute

[60] The identification of regional forecasting models using space: time correlation functions PDF

Cannot Refute

Contribution

iREPA: improved training recipe accentuating spatial feature transfer

[71] AplusN: Progressively Integrating Attention and Normalization in Wavelet Domain for Pose Transfer PDF

Cannot Refute

[72] Deep learning for cerebral vascular occlusion segmentation: a novel ConvNeXtV2 and GRN-integrated U-Net framework for diffusion-weighted imaging PDF

Cannot Refute

[73] YOLO-SFT: Road Damage Detection Algorithm Based on Feature Diffusion PDF

Cannot Refute

[74] Semantic diffusion network for semantic segmentation PDF

Cannot Refute

[75] A cellular traffic prediction method based on diffusion convolutional GRU and multi-head attention mechanism PDF

Cannot Refute

[76] Diffusion Augmented Flows: Combining Normalizing Flows and Diffusion Models for Accurate Latent Space Mapping PDF

Cannot Refute

[77] A quadruple diffusion convolutional recurrent network for human motion prediction PDF

Cannot Refute

[78] Spooky Action at a Distance: Normalization Layers Enable Side-Channel Spatial Communication PDF

Cannot Refute

[79] Objective detection of eloquent axonal pathways to minimize postoperative deficits in pediatric epilepsy surgery using diffusion tractography and convolutional neural â¦ PDF

Cannot Refute

[80] Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization PDF

Cannot Refute

What matters for Representation Alignment: Global Information or Spatial Structure?

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Representation alignment for generation: Training diffusion transformers is easier than you think PDF

[27] Diffusion model as representation learner PDF

Contribution Analysis

Large-scale empirical analysis showing spatial structure drives representation alignment effectiveness

[61] DocLLM: A layout-aware generative language model for multimodal document understanding PDF

[62] Inversion-free image editing with language-guided diffusion models PDF

[63] Compositional transformers for scene generation PDF

[64] Few shot generative model adaption via relaxed spatial structural alignment PDF

[65] DiffusePast: Diffusion-based Generative Replay for Class Incremental Semantic Segmentation PDF

[66] Tablegpt: Few-shot table-to-text generation with table structure reconstruction and content matching PDF

[67] REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion PDF

[68] RecompGPT: Generative Pre-trained Transformers-assisted Human Gaze Pattern Learning and Distribution Modeling for Scene Recomposition PDF

[69] MV-MambaNet: multiscale and multiview visual question answering network for 3D medical images PDF

[70] HISPACE: Histological Image Synthesis with Pattern And Content Engine PDF

Spatial Structure Metric (SSM) for predicting representation alignment performance

[51] SP-GEM: Spatial Pattern-Aware Graph Embedding for Matching Multisource Road Networks PDF

[52] Multitask representations in the human cortex transform along a sensory-to-motor hierarchy PDF

[53] A multiscale road matching method based on hierarchical road meshes PDF

[54] Matching the building footprints of different vector spatial datasets at a similar scale based on one-class support vector machines PDF

[55] Global optimisation matching method for multi-representation buildings constrained by road network PDF

[56] Design and evaluation of algorithms for image retrieval by spatial similarity PDF

[57] A Multi-Scale Residential Areas Matching Method Considering Spatial Neighborhood Features PDF

[58] Neural pattern similarity reveals the inherent intersection of social categories PDF

[59] A survey of measures and methods for matching geospatial vector datasets PDF

[60] The identification of regional forecasting models using space: time correlation functions PDF

iREPA: improved training recipe accentuating spatial feature transfer

[71] AplusN: Progressively Integrating Attention and Normalization in Wavelet Domain for Pose Transfer PDF

[72] Deep learning for cerebral vascular occlusion segmentation: a novel ConvNeXtV2 and GRN-integrated U-Net framework for diffusion-weighted imaging PDF

[73] YOLO-SFT: Road Damage Detection Algorithm Based on Feature Diffusion PDF

[74] Semantic diffusion network for semantic segmentation PDF

[75] A cellular traffic prediction method based on diffusion convolutional GRU and multi-head attention mechanism PDF

[76] Diffusion Augmented Flows: Combining Normalizing Flows and Diffusion Models for Accurate Latent Space Mapping PDF

[77] A quadruple diffusion convolutional recurrent network for human motion prediction PDF

[78] Spooky Action at a Distance: Normalization Layers Enable Side-Channel Spatial Communication PDF

[79] Objective detection of eloquent axonal pathways to minimize postoperative deficits in pediatric epilepsy surgery using diffusion tractography and convolutional neural â¦ PDF

[80] Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization PDF

Table of Contents

[79] Objective detection of eloquent axonal pathways to minimize postoperative deficits in pediatric epilepsy surgery using diffusion tractography and convolutional neural â¦ PDF