Reconciling Visual Perception and Generation in Diffusion Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Visual PerceptionImage ClassificationObject DetectionSemantic Segmentation

We present \textsc{GenRep}, a unified image understanding and synthesis model that jointly conducts discriminative learning and generative modeling in one training session. By leveraging Monte Carlo approximation, \textsc{GenRep} distills distributional knowledge embedded in diffusion models to guide the discriminative learning for visual perception tasks. Simultaneously, a semantic-driven image generation process is established, where high-level semantics learned from perception tasks can be used to inform image synthesis, creating a positive feedback loop for mutual boosts. Moreover, to reconcile the learning process for both tasks, a gradient alignment strategy is proposed to symmetrically modify the optimization directions of perception and generation losses. These designs empower \textsc{GenRep} to be a versatile and powerful model that achieves top-leading performance on both image understanding and generation benchmarks. Code will be released after acceptance.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes GenRep, a unified model that jointly performs discriminative learning and generative modeling within a single diffusion framework. It resides in the Alignment and Distillation Methods leaf under Training Strategies and Optimization, alongside two sibling papers (Reconstruction Alignment and X2i). This leaf is relatively sparse within the broader taxonomy of 50 papers across 36 topics, suggesting that techniques explicitly reconciling perception and generation objectives through alignment strategies remain an active but not yet saturated research direction.

The taxonomy reveals neighboring work in Unified Multimodal Architectures (e.g., Shared Embedding Space Approaches with four papers, Dual-Branch Architectures with three papers) that pursue joint training through architectural design rather than alignment-focused optimization. Additionally, Self-Supervised Pretraining (two papers) explores initialization strategies without explicit alignment, while Latent Space Stabilization (one paper) addresses regularization concerns. GenRep's focus on gradient-level alignment and distillation distinguishes it from these architectural and pretraining-centric approaches, positioning it at the intersection of optimization strategy and multimodal unification.

Among 30 candidates examined, the contribution-level analysis reveals limited prior work overlap. The unified model contribution examined 10 candidates with 1 refutable match, the Monte Carlo distillation contribution examined 10 candidates with 1 refutable match, and the gradient alignment strategy examined 10 candidates with 2 refutable matches. These statistics suggest that while some related techniques exist in the limited search scope, the specific combination of distillation, semantic-driven generation, and gradient alignment appears less extensively covered. The gradient alignment component shows slightly more prior work presence, indicating this may be a more explored sub-area.

Based on the top-30 semantic matches examined, the work appears to occupy a moderately novel position within alignment-based training strategies for unified models. The analysis does not cover the full breadth of diffusion model literature, and the taxonomy structure suggests this research direction is still developing. The combination of contributions may offer incremental advances over existing alignment methods, though the limited search scope prevents definitive claims about the degree of novelty relative to the entire field.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: unified image understanding and synthesis in diffusion models. The field has evolved into a rich ecosystem organized around several complementary directions. Unified Multimodal Architectures explore how to build single frameworks that handle both perception and generation tasks, often leveraging shared representations across modalities. Training Strategies and Optimization investigates techniques such as alignment and distillation methods to improve model efficiency and quality, while Conditional Generation and Control focuses on guiding synthesis through various input signals. Representation Learning and Feature Extraction examines how diffusion models can serve as feature extractors for downstream tasks, and Data Synthesis and Augmentation leverages generative capabilities to create training data. Domain-Specific Applications adapt these methods to specialized areas like medical imaging or 3D synthesis, and Challenges and Analysis addresses robustness, copying issues, and evaluation metrics. Survey and Comparative Studies provide overarching perspectives on the rapidly growing literature. Within this landscape, a particularly active line of work centers on bridging the gap between understanding and generation. Reconciling Perception Generation[0] sits squarely in the Training Strategies and Optimization branch, specifically within Alignment and Distillation Methods, where it addresses how to align perceptual and generative objectives within a unified diffusion framework. This contrasts with neighboring efforts like X2i[16] and Reconstruction Alignment[34], which also explore alignment strategies but emphasize different trade-offs between reconstruction fidelity and generative flexibility. Meanwhile, works in Unified Multimodal Architectures such as One Diffusion[41] and BLIP3-o[14] pursue end-to-end joint training across tasks, raising questions about whether alignment should happen at the architectural level or through post-hoc distillation. The interplay between these approaches highlights ongoing debates about modularity versus integration, and whether perceptual and generative capacities are best reconciled through shared training objectives or through carefully designed architectural inductive biases.

Claimed Contributions

GENREP unified model for joint discriminative and generative learning

Can Refute

10 retrieved papers

The authors introduce GENREP, a unified framework that simultaneously performs visual perception tasks and image generation within a single model and training process, bridging the gap between traditionally separate discriminative and generative paradigms.

10 retrieved papers

Can Refute

Distributional knowledge distillation via Monte Carlo approximation

Can Refute

10 retrieved papers

The authors propose a method that uses Markov Chain Monte Carlo approximation over intermediate reverse diffusion outputs to estimate conditional distributions p(x|y), then applies Bayes' theorem to obtain posterior probabilities that guide discriminative learning through a KL divergence loss.

10 retrieved papers

Can Refute

Gradient alignment strategy for reconciling dual objectives

Can Refute

10 retrieved papers

The authors develop a gradient alignment mechanism that decomposes gradients from perception and generation losses into parallel and orthogonal components, then adaptively dampens conflicting directions while preserving non-conflicting information to harmonize joint optimization of both tasks.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[16] X2i: Seamless integration of multimodal understanding into diffusion transformer via attention distillation PDF

Ma Jian, Peng Qi-rong, Guo, Xu, Chen Chen, Lu, Haonan, Yang Zhenyu (2025)

[34] Reconstruction Alignment Improves Unified Multimodal Models PDF

Xie Ji, Darrell, Trevor, Ji Xie, Zettlemoyer Luke, Trevor Darrell, Wang Xudong, Luke S. Zettlemoyer, Xudong Wang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

GENREP unified model for joint discriminative and generative learning

[80] Your vit is secretly a hybrid discriminative-generative diffusion model PDF

Can Refute

[71] Disentangled representation learning PDF

Cannot Refute

[72] Video playback rate perception for self-supervised spatio-temporal representation learning PDF

Cannot Refute

[73] Do text-free diffusion models learn discriminative visual representations? PDF

Cannot Refute

[74] Glaucoma progression detection and humphrey visual field prediction using discriminative and generative vision transformers PDF

Cannot Refute

[75] InstructDiffusion: A Generalist Modeling Interface for Vision Tasks PDF

Cannot Refute

[76] InternVideo: General Video Foundation Models via Generative and Discriminative Learning PDF

Cannot Refute

[77] Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model PDF

Cannot Refute

[78] Toward the unification of generative and discriminative visual foundation model: a survey PDF

Cannot Refute

[79] CM-GANs: Cross-modal generative adversarial networks for common representation learning PDF

Cannot Refute

Contribution

Distributional knowledge distillation via Monte Carlo approximation

[53] Your Diffusion Model is Secretly a Zero-Shot Classifier PDF

Can Refute

[51] Knowledge distillation for object detection with diffusion model PDF

Cannot Refute

[52] Denoising diffusion models for out-of-distribution detection PDF

Cannot Refute

[54] Improving long-tailed pest classification using diffusion model-based data augmentation PDF

Cannot Refute

[55] Diffusion model as representation learner PDF

Cannot Refute

[56] Balancing Act: Distribution-Guided Debiasing in Diffusion Models PDF

Cannot Refute

[57] Improved Distribution Difference Driven Diffusion Generative Method for AMOSR PDF

Cannot Refute

[58] A 3D Self-Awareness Diffusion Network for Multimodal Classification PDF

Cannot Refute

[59] DiffKD: collaborative graph diffusion with knowledge distillation for multimodal recommendation PDF

Cannot Refute

[60] Data mining framework leveraging stable diffusion: a unified approach for classification and anomaly detection PDF

Cannot Refute

Contribution

Gradient alignment strategy for reconciling dual objectives

[61] Gradient Surgery for Multi-Task Learning PDF

Can Refute

[63] Conflict-Averse Gradient Descent for Multi-task Learning PDF

Can Refute

[62] Learning to optimize by multi-gradient for multi-objective optimization PDF

Cannot Refute

[64] Independent component alignment for multi-task learning PDF

Cannot Refute

[65] Federated natural policy gradient methods for multi-task reinforcement learning PDF

Cannot Refute

[66] Aligned Multi Objective Optimization PDF

Cannot Refute

[67] Gradient similarity surgery in multi-task deep learning PDF

Cannot Refute

[68] Multi-task learning as multi-objective optimization PDF

Cannot Refute

[69] The stochastic multi-gradient algorithm for multi-objective optimization and its application to supervised machine learning PDF

Cannot Refute

[70] Gradient Statistics-Based Multi-Objective Optimization in Physics-Informed Neural Networks PDF

Cannot Refute

Reconciling Visual Perception and Generation in Diffusion Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[16] X2i: Seamless integration of multimodal understanding into diffusion transformer via attention distillation PDF

[34] Reconstruction Alignment Improves Unified Multimodal Models PDF

Contribution Analysis

GENREP unified model for joint discriminative and generative learning

[80] Your vit is secretly a hybrid discriminative-generative diffusion model PDF

[71] Disentangled representation learning PDF

[72] Video playback rate perception for self-supervised spatio-temporal representation learning PDF

[73] Do text-free diffusion models learn discriminative visual representations? PDF

[74] Glaucoma progression detection and humphrey visual field prediction using discriminative and generative vision transformers PDF

[75] InstructDiffusion: A Generalist Modeling Interface for Vision Tasks PDF

[76] InternVideo: General Video Foundation Models via Generative and Discriminative Learning PDF

[77] Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model PDF

[78] Toward the unification of generative and discriminative visual foundation model: a survey PDF

[79] CM-GANs: Cross-modal generative adversarial networks for common representation learning PDF

Distributional knowledge distillation via Monte Carlo approximation

[53] Your Diffusion Model is Secretly a Zero-Shot Classifier PDF

[51] Knowledge distillation for object detection with diffusion model PDF

[52] Denoising diffusion models for out-of-distribution detection PDF

[54] Improving long-tailed pest classification using diffusion model-based data augmentation PDF

[55] Diffusion model as representation learner PDF

[56] Balancing Act: Distribution-Guided Debiasing in Diffusion Models PDF

[57] Improved Distribution Difference Driven Diffusion Generative Method for AMOSR PDF

[58] A 3D Self-Awareness Diffusion Network for Multimodal Classification PDF

[59] DiffKD: collaborative graph diffusion with knowledge distillation for multimodal recommendation PDF

[60] Data mining framework leveraging stable diffusion: a unified approach for classification and anomaly detection PDF

Gradient alignment strategy for reconciling dual objectives

[61] Gradient Surgery for Multi-Task Learning PDF

[63] Conflict-Averse Gradient Descent for Multi-task Learning PDF

[62] Learning to optimize by multi-gradient for multi-objective optimization PDF

[64] Independent component alignment for multi-task learning PDF

[65] Federated natural policy gradient methods for multi-task reinforcement learning PDF

[66] Aligned Multi Objective Optimization PDF

[67] Gradient similarity surgery in multi-task deep learning PDF

[68] Multi-task learning as multi-objective optimization PDF

[69] The stochastic multi-gradient algorithm for multi-objective optimization and its application to supervised machine learning PDF

[70] Gradient Statistics-Based Multi-Objective Optimization in Physics-Informed Neural Networks PDF

Table of Contents