Reconciling Visual Perception and Generation in Diffusion Models

ICLR 2026 Conference SubmissionAnonymous Authors
Visual PerceptionImage ClassificationObject DetectionSemantic Segmentation
Abstract:

We present \textsc{GenRep}, a unified image understanding and synthesis model that jointly conducts discriminative learning and generative modeling in one training session. By leveraging Monte Carlo approximation, \textsc{GenRep} distills distributional knowledge embedded in diffusion models to guide the discriminative learning for visual perception tasks. Simultaneously, a semantic-driven image generation process is established, where high-level semantics learned from perception tasks can be used to inform image synthesis, creating a positive feedback loop for mutual boosts. Moreover, to reconcile the learning process for both tasks, a gradient alignment strategy is proposed to symmetrically modify the optimization directions of perception and generation losses. These designs empower \textsc{GenRep} to be a versatile and powerful model that achieves top-leading performance on both image understanding and generation benchmarks. Code will be released after acceptance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes GenRep, a unified model that jointly performs discriminative learning and generative modeling within a single diffusion framework. It resides in the Alignment and Distillation Methods leaf under Training Strategies and Optimization, alongside two sibling papers (Reconstruction Alignment and X2i). This leaf is relatively sparse within the broader taxonomy of 50 papers across 36 topics, suggesting that techniques explicitly reconciling perception and generation objectives through alignment strategies remain an active but not yet saturated research direction.

The taxonomy reveals neighboring work in Unified Multimodal Architectures (e.g., Shared Embedding Space Approaches with four papers, Dual-Branch Architectures with three papers) that pursue joint training through architectural design rather than alignment-focused optimization. Additionally, Self-Supervised Pretraining (two papers) explores initialization strategies without explicit alignment, while Latent Space Stabilization (one paper) addresses regularization concerns. GenRep's focus on gradient-level alignment and distillation distinguishes it from these architectural and pretraining-centric approaches, positioning it at the intersection of optimization strategy and multimodal unification.

Among 30 candidates examined, the contribution-level analysis reveals limited prior work overlap. The unified model contribution examined 10 candidates with 1 refutable match, the Monte Carlo distillation contribution examined 10 candidates with 1 refutable match, and the gradient alignment strategy examined 10 candidates with 2 refutable matches. These statistics suggest that while some related techniques exist in the limited search scope, the specific combination of distillation, semantic-driven generation, and gradient alignment appears less extensively covered. The gradient alignment component shows slightly more prior work presence, indicating this may be a more explored sub-area.

Based on the top-30 semantic matches examined, the work appears to occupy a moderately novel position within alignment-based training strategies for unified models. The analysis does not cover the full breadth of diffusion model literature, and the taxonomy structure suggests this research direction is still developing. The combination of contributions may offer incremental advances over existing alignment methods, though the limited search scope prevents definitive claims about the degree of novelty relative to the entire field.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: unified image understanding and synthesis in diffusion models. The field has evolved into a rich ecosystem organized around several complementary directions. Unified Multimodal Architectures explore how to build single frameworks that handle both perception and generation tasks, often leveraging shared representations across modalities. Training Strategies and Optimization investigates techniques such as alignment and distillation methods to improve model efficiency and quality, while Conditional Generation and Control focuses on guiding synthesis through various input signals. Representation Learning and Feature Extraction examines how diffusion models can serve as feature extractors for downstream tasks, and Data Synthesis and Augmentation leverages generative capabilities to create training data. Domain-Specific Applications adapt these methods to specialized areas like medical imaging or 3D synthesis, and Challenges and Analysis addresses robustness, copying issues, and evaluation metrics. Survey and Comparative Studies provide overarching perspectives on the rapidly growing literature. Within this landscape, a particularly active line of work centers on bridging the gap between understanding and generation. Reconciling Perception Generation[0] sits squarely in the Training Strategies and Optimization branch, specifically within Alignment and Distillation Methods, where it addresses how to align perceptual and generative objectives within a unified diffusion framework. This contrasts with neighboring efforts like X2i[16] and Reconstruction Alignment[34], which also explore alignment strategies but emphasize different trade-offs between reconstruction fidelity and generative flexibility. Meanwhile, works in Unified Multimodal Architectures such as One Diffusion[41] and BLIP3-o[14] pursue end-to-end joint training across tasks, raising questions about whether alignment should happen at the architectural level or through post-hoc distillation. The interplay between these approaches highlights ongoing debates about modularity versus integration, and whether perceptual and generative capacities are best reconciled through shared training objectives or through carefully designed architectural inductive biases.

Claimed Contributions

GENREP unified model for joint discriminative and generative learning

The authors introduce GENREP, a unified framework that simultaneously performs visual perception tasks and image generation within a single model and training process, bridging the gap between traditionally separate discriminative and generative paradigms.

10 retrieved papers
Can Refute
Distributional knowledge distillation via Monte Carlo approximation

The authors propose a method that uses Markov Chain Monte Carlo approximation over intermediate reverse diffusion outputs to estimate conditional distributions p(x|y), then applies Bayes' theorem to obtain posterior probabilities that guide discriminative learning through a KL divergence loss.

10 retrieved papers
Can Refute
Gradient alignment strategy for reconciling dual objectives

The authors develop a gradient alignment mechanism that decomposes gradients from perception and generation losses into parallel and orthogonal components, then adaptively dampens conflicting directions while preserving non-conflicting information to harmonize joint optimization of both tasks.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

GENREP unified model for joint discriminative and generative learning

The authors introduce GENREP, a unified framework that simultaneously performs visual perception tasks and image generation within a single model and training process, bridging the gap between traditionally separate discriminative and generative paradigms.

Contribution

Distributional knowledge distillation via Monte Carlo approximation

The authors propose a method that uses Markov Chain Monte Carlo approximation over intermediate reverse diffusion outputs to estimate conditional distributions p(x|y), then applies Bayes' theorem to obtain posterior probabilities that guide discriminative learning through a KL divergence loss.

Contribution

Gradient alignment strategy for reconciling dual objectives

The authors develop a gradient alignment mechanism that decomposes gradients from perception and generation losses into parallel and orthogonal components, then adaptively dampens conflicting directions while preserving non-conflicting information to harmonize joint optimization of both tasks.

Reconciling Visual Perception and Generation in Diffusion Models | Novelty Validation