Reconstruction Alignment Improves Unified Multimodal Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Unified Multimodal Models; Image Generation; Image Editing; Visual Understanding

Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image–text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details, even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense “text prompts,” providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73 → 0.90) and DPGBench (80.93 → 88.15), while also boosting editing benchmarks (ImgEdit 3.38 → 3.75, GEdit 6.94 → 7.27). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Reconstruction Alignment (RecA), a post-training method that uses visual understanding embeddings as dense supervision to realign understanding and generation in unified multimodal models. Within the taxonomy, it resides in the 'Post-Training Alignment and Refinement' leaf, which contains only two papers total. This represents a relatively sparse research direction compared to more crowded areas like 'Autoregressive Unified Models' (six papers) or 'Instruction Tuning and Task Adaptation' (four papers), suggesting the specific focus on post-training reconstruction-based alignment is less explored.

The taxonomy reveals that RecA's parent category 'Training Strategies and Alignment Methods' encompasses four distinct approaches: post-training refinement, instruction tuning, multi-stage training, and reinforcement learning. Neighboring leaves like 'Instruction Tuning and Task Adaptation' focus on task-specific adaptation through instruction formats, while 'Multi-Stage and Progressive Training' emphasizes phased learning recipes. RecA diverges by targeting post-training realignment through self-supervised reconstruction rather than instruction-following or progressive curricula. The taxonomy's scope note explicitly distinguishes post-training methods from pre-training and instruction tuning strategies, positioning RecA as addressing a later-stage alignment challenge.

Across three contributions examined, the literature search analyzed thirty candidate papers total, finding zero clearly refutable instances for any contribution. Specifically, the 'Reconstruction Alignment method' examined ten candidates with none providing overlapping prior work; 'Broad applicability across architectures' similarly found no refutations among ten candidates; and 'Efficient post-training achieving SOTA' showed the same pattern. Given the limited search scope of thirty semantically similar papers rather than an exhaustive survey, these statistics suggest that among closely related work examined, no direct precedents for reconstruction-based post-training alignment were identified, though the search scale leaves room for unexamined literature.

Based on the limited thirty-candidate search and sparse taxonomy positioning, RecA appears to occupy a relatively underexplored niche within post-training alignment methods. The absence of refutable prior work among examined candidates, combined with only one sibling paper in its taxonomy leaf, suggests novelty within the scope analyzed. However, the analysis explicitly covers top-K semantic matches rather than comprehensive field coverage, meaning potential related work in adjacent areas like multi-stage training or continuous tokenization may exist outside the examined set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: aligning visual understanding and generation in unified multimodal models. The field has organized itself around several complementary dimensions. Visual Representation and Tokenization Strategies explore how to encode images and videos into discrete tokens that language models can process, with approaches ranging from vector quantization to learned codebooks. Unified Architectural Paradigms investigate end-to-end designs that handle both perception and generation within a single framework, often building on transformer or diffusion backbones. Training Strategies and Alignment Methods address the challenge of teaching models to seamlessly transition between interpreting and creating visual content, including pre-training recipes, instruction tuning, and post-training refinement techniques. Domain-Specific and Specialized Applications adapt these unified models to particular use cases such as medical imaging, video synthesis, or fashion design. Evaluation, Analysis, and Benchmarking provide the metrics and datasets needed to assess cross-modal performance, while Modality Integration and Cross-Modal Learning focus on bridging text, vision, and other signals. Representative works like Janus[12] and Show O[23] illustrate how different architectural choices lead to distinct trade-offs in generation quality versus understanding accuracy. A particularly active line of work centers on post-training alignment and refinement, where models that have been pre-trained on large-scale data undergo additional tuning to better harmonize their dual capabilities. Reconstruction Alignment[0] exemplifies this direction by using reconstruction objectives to tighten the coupling between visual encoders and decoders, ensuring that generated outputs remain faithful to understood inputs. This contrasts with approaches like ILLUME Plus[37], which emphasizes iterative refinement and feedback loops to improve generation fidelity. Meanwhile, works such as Metamorph[1] and Dualtoken[3] explore alternative tokenization schemes that may reduce the need for extensive post-training by designing representations that are inherently more aligned. The central tension across these branches involves balancing the expressiveness of visual tokens with the computational cost of alignment, and deciding whether to unify representations early in the architecture or reconcile them through later-stage training. Reconstruction Alignment[0] sits squarely within the post-training refinement cluster, sharing the goal of tightening understanding-generation coherence but differing from neighbors like ILLUME Plus[37] in its emphasis on direct reconstruction losses rather than multi-stage feedback.

Claimed Contributions

Reconstruction Alignment (RECA) post-training method

10 retrieved papers

RECA is a self-supervised post-training approach that conditions unified multimodal models on their own visual understanding embeddings to reconstruct input images, thereby realigning understanding and generation capabilities without requiring text captions.

10 retrieved papers

Broad applicability across diverse UMM architectures

10 retrieved papers

The method demonstrates generality by delivering consistent performance improvements across multiple unified multimodal model families with different generation mechanisms, including discrete token prediction, masked autoregressive, and continuous diffusion approaches.

10 retrieved papers

Efficient post-training strategy achieving SOTA performance

10 retrieved papers

RECA achieves state-of-the-art results on image generation and editing benchmarks with minimal computational cost, enabling a 1.5B-parameter model to surpass much larger open-source models without requiring GPT-4o distillation data or reinforcement learning.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[37] ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement PDF

Huang, Runhui, Wang Chun-wei, Runhu Huang, Yang Junwei, Chunwei Wang, Lu, Guansong, Junwei Yang, Yuan Yun-long, Guansong Lu, Han Jian-hua, Yunlong Yuan, Hou, Jianhua Han, Zhang Wei, Lu Hou, Hong, Lanqing, Wei Zhang, Zhao, Hengshuang, Lanqing Hong, Xu Hang, Hengshuang Zhao, Hang Xu (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Reconstruction Alignment (RECA) post-training method

[66] Self-supervised multimodal learning: A survey PDF

Cannot Refute

[67] Self-supervised multimodal versatile networks PDF

Cannot Refute

[68] S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Model with Spatio-Temporal Visual Representation PDF

Cannot Refute

[69] Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment PDF

Cannot Refute

[70] Zeronlg: Aligning and autoencoding domains for zero-shot multimodal and multilingual natural language generation PDF

Cannot Refute

[71] TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation PDF

Cannot Refute

[72] A Self-Supervised Generative Fine-Tune of CLIP for VQA for Visually Impaired PDF

Cannot Refute

[73] GAIR: Improving Multimodal Geo-Foundation Model with Geo-Aligned Implicit Representations PDF

Cannot Refute

[74] Self-supervised multimodal opinion summarization PDF

Cannot Refute

[75] Separating the âChirpâ from the âChatâ: Self-supervised Visual Grounding of Sound and Language PDF

Cannot Refute

Contribution

Broad applicability across diverse UMM architectures

[12] Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation PDF

Cannot Refute

[20] MMaDA: Multimodal Large Diffusion Language Models PDF

Cannot Refute

[23] Show-o: One Single Transformer to Unify Multimodal Understanding and Generation PDF

Cannot Refute

[25] Unified autoregressive visual generation and understanding with continuous tokens PDF

Cannot Refute

[47] Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation PDF

Cannot Refute

[61] Unigenx: Unified generation of sequence and structure with autoregressive diffusion PDF

Cannot Refute

[62] Chatvla: Unified multimodal understanding and robot control with vision-language-action model PDF

Cannot Refute

[63] Lumos-1: On autoregressive video generation from a unified model perspective PDF

Cannot Refute

[64] Dual diffusion for unified image generation and understanding PDF

Cannot Refute

[65] Conditional Panoramic Image Generation via Masked Autoregressive Modeling PDF

Cannot Refute

Contribution

Efficient post-training strategy achieving SOTA performance

[51] Diffusion adversarial post-training for one-step video generation PDF

Cannot Refute

[52] Emu edit: Precise image editing via recognition and generation tasks PDF

Cannot Refute

[53] Ptqd: Accurate post-training quantization for diffusion models PDF

Cannot Refute

[54] Q-dit: Accurate post-training quantization for diffusion transformers PDF

Cannot Refute

[55] Factuality Matters: When Image Generation and Editing Meet Structured Visuals PDF

Cannot Refute

[56] Imgedit: A unified image editing dataset and benchmark PDF

Cannot Refute

[57] Towards accurate post-training quantization for diffusion models PDF

Cannot Refute

[58] Seedream 4.0: Toward next-generation multimodal image generation PDF

Cannot Refute

[59] Imagen editor and editbench: Advancing and evaluating text-guided image inpainting PDF

Cannot Refute

[60] Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step PDF

Cannot Refute

Reconstruction Alignment Improves Unified Multimodal Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[37] ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement PDF

Contribution Analysis

Reconstruction Alignment (RECA) post-training method

[66] Self-supervised multimodal learning: A survey PDF

[67] Self-supervised multimodal versatile networks PDF

[68] S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Model with Spatio-Temporal Visual Representation PDF

[69] Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment PDF

[70] Zeronlg: Aligning and autoencoding domains for zero-shot multimodal and multilingual natural language generation PDF

[71] TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation PDF

[72] A Self-Supervised Generative Fine-Tune of CLIP for VQA for Visually Impaired PDF

[73] GAIR: Improving Multimodal Geo-Foundation Model with Geo-Aligned Implicit Representations PDF

[74] Self-supervised multimodal opinion summarization PDF

[75] Separating the âChirpâ from the âChatâ: Self-supervised Visual Grounding of Sound and Language PDF

Broad applicability across diverse UMM architectures

[12] Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation PDF

[20] MMaDA: Multimodal Large Diffusion Language Models PDF

[23] Show-o: One Single Transformer to Unify Multimodal Understanding and Generation PDF

[25] Unified autoregressive visual generation and understanding with continuous tokens PDF

[47] Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation PDF

[61] Unigenx: Unified generation of sequence and structure with autoregressive diffusion PDF

[62] Chatvla: Unified multimodal understanding and robot control with vision-language-action model PDF

[63] Lumos-1: On autoregressive video generation from a unified model perspective PDF

[64] Dual diffusion for unified image generation and understanding PDF

[65] Conditional Panoramic Image Generation via Masked Autoregressive Modeling PDF

Efficient post-training strategy achieving SOTA performance

[51] Diffusion adversarial post-training for one-step video generation PDF

[52] Emu edit: Precise image editing via recognition and generation tasks PDF

[53] Ptqd: Accurate post-training quantization for diffusion models PDF

[54] Q-dit: Accurate post-training quantization for diffusion transformers PDF

[55] Factuality Matters: When Image Generation and Editing Meet Structured Visuals PDF

[56] Imgedit: A unified image editing dataset and benchmark PDF

[57] Towards accurate post-training quantization for diffusion models PDF

[58] Seedream 4.0: Toward next-generation multimodal image generation PDF

[59] Imagen editor and editbench: Advancing and evaluating text-guided image inpainting PDF

[60] Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step PDF

Table of Contents

[75] Separating the âChirpâ from the âChatâ: Self-supervised Visual Grounding of Sound and Language PDF