Reconstruction Alignment Improves Unified Multimodal Models

ICLR 2026 Conference SubmissionAnonymous Authors
Unified Multimodal Models; Image Generation; Image Editing; Visual Understanding
Abstract:

Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image–text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details, even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense “text prompts,” providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73 → 0.90) and DPGBench (80.93 → 88.15), while also boosting editing benchmarks (ImgEdit 3.38 → 3.75, GEdit 6.94 → 7.27). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Reconstruction Alignment (RecA), a post-training method that uses visual understanding embeddings as dense supervision to realign understanding and generation in unified multimodal models. Within the taxonomy, it resides in the 'Post-Training Alignment and Refinement' leaf, which contains only two papers total. This represents a relatively sparse research direction compared to more crowded areas like 'Autoregressive Unified Models' (six papers) or 'Instruction Tuning and Task Adaptation' (four papers), suggesting the specific focus on post-training reconstruction-based alignment is less explored.

The taxonomy reveals that RecA's parent category 'Training Strategies and Alignment Methods' encompasses four distinct approaches: post-training refinement, instruction tuning, multi-stage training, and reinforcement learning. Neighboring leaves like 'Instruction Tuning and Task Adaptation' focus on task-specific adaptation through instruction formats, while 'Multi-Stage and Progressive Training' emphasizes phased learning recipes. RecA diverges by targeting post-training realignment through self-supervised reconstruction rather than instruction-following or progressive curricula. The taxonomy's scope note explicitly distinguishes post-training methods from pre-training and instruction tuning strategies, positioning RecA as addressing a later-stage alignment challenge.

Across three contributions examined, the literature search analyzed thirty candidate papers total, finding zero clearly refutable instances for any contribution. Specifically, the 'Reconstruction Alignment method' examined ten candidates with none providing overlapping prior work; 'Broad applicability across architectures' similarly found no refutations among ten candidates; and 'Efficient post-training achieving SOTA' showed the same pattern. Given the limited search scope of thirty semantically similar papers rather than an exhaustive survey, these statistics suggest that among closely related work examined, no direct precedents for reconstruction-based post-training alignment were identified, though the search scale leaves room for unexamined literature.

Based on the limited thirty-candidate search and sparse taxonomy positioning, RecA appears to occupy a relatively underexplored niche within post-training alignment methods. The absence of refutable prior work among examined candidates, combined with only one sibling paper in its taxonomy leaf, suggests novelty within the scope analyzed. However, the analysis explicitly covers top-K semantic matches rather than comprehensive field coverage, meaning potential related work in adjacent areas like multi-stage training or continuous tokenization may exist outside the examined set.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: aligning visual understanding and generation in unified multimodal models. The field has organized itself around several complementary dimensions. Visual Representation and Tokenization Strategies explore how to encode images and videos into discrete tokens that language models can process, with approaches ranging from vector quantization to learned codebooks. Unified Architectural Paradigms investigate end-to-end designs that handle both perception and generation within a single framework, often building on transformer or diffusion backbones. Training Strategies and Alignment Methods address the challenge of teaching models to seamlessly transition between interpreting and creating visual content, including pre-training recipes, instruction tuning, and post-training refinement techniques. Domain-Specific and Specialized Applications adapt these unified models to particular use cases such as medical imaging, video synthesis, or fashion design. Evaluation, Analysis, and Benchmarking provide the metrics and datasets needed to assess cross-modal performance, while Modality Integration and Cross-Modal Learning focus on bridging text, vision, and other signals. Representative works like Janus[12] and Show O[23] illustrate how different architectural choices lead to distinct trade-offs in generation quality versus understanding accuracy. A particularly active line of work centers on post-training alignment and refinement, where models that have been pre-trained on large-scale data undergo additional tuning to better harmonize their dual capabilities. Reconstruction Alignment[0] exemplifies this direction by using reconstruction objectives to tighten the coupling between visual encoders and decoders, ensuring that generated outputs remain faithful to understood inputs. This contrasts with approaches like ILLUME Plus[37], which emphasizes iterative refinement and feedback loops to improve generation fidelity. Meanwhile, works such as Metamorph[1] and Dualtoken[3] explore alternative tokenization schemes that may reduce the need for extensive post-training by designing representations that are inherently more aligned. The central tension across these branches involves balancing the expressiveness of visual tokens with the computational cost of alignment, and deciding whether to unify representations early in the architecture or reconcile them through later-stage training. Reconstruction Alignment[0] sits squarely within the post-training refinement cluster, sharing the goal of tightening understanding-generation coherence but differing from neighbors like ILLUME Plus[37] in its emphasis on direct reconstruction losses rather than multi-stage feedback.

Claimed Contributions

Reconstruction Alignment (RECA) post-training method

RECA is a self-supervised post-training approach that conditions unified multimodal models on their own visual understanding embeddings to reconstruct input images, thereby realigning understanding and generation capabilities without requiring text captions.

10 retrieved papers
Broad applicability across diverse UMM architectures

The method demonstrates generality by delivering consistent performance improvements across multiple unified multimodal model families with different generation mechanisms, including discrete token prediction, masked autoregressive, and continuous diffusion approaches.

10 retrieved papers
Efficient post-training strategy achieving SOTA performance

RECA achieves state-of-the-art results on image generation and editing benchmarks with minimal computational cost, enabling a 1.5B-parameter model to surpass much larger open-source models without requiring GPT-4o distillation data or reinforcement learning.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Reconstruction Alignment (RECA) post-training method

RECA is a self-supervised post-training approach that conditions unified multimodal models on their own visual understanding embeddings to reconstruct input images, thereby realigning understanding and generation capabilities without requiring text captions.

Contribution

Broad applicability across diverse UMM architectures

The method demonstrates generality by delivering consistent performance improvements across multiple unified multimodal model families with different generation mechanisms, including discrete token prediction, masked autoregressive, and continuous diffusion approaches.

Contribution

Efficient post-training strategy achieving SOTA performance

RECA achieves state-of-the-art results on image generation and editing benchmarks with minimal computational cost, enabling a 1.5B-parameter model to surpass much larger open-source models without requiring GPT-4o distillation data or reinforcement learning.

Reconstruction Alignment Improves Unified Multimodal Models | Novelty Validation