Turning Internal Gap into Self-Improvement: Promoting the Generation-Understanding Unification in MLLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

MLLMSelf-improvementUnification

Although unified MLLMs aim to unify generation and understanding, they are considered to exhibit an internal gap, with understanding outperforming generation. Through large‑scale evaluation across multiple MLLMs and tasks, we confirm the widespread non‑unification of MLLMs, and demonstrate that it indeed stems from weak generation rather than misunderstanding. This finding motivates us to propose a simple yet effective internal gap-based self-improvement framework, which mitigates internal gaps by leveraging stronger understanding to guide weaker generation without relying on any external signals. We validate this strategy through comprehensive experiments: scoring generations with understanding to construct image data for post-training (e.g., SFT and DPO) significantly improves generation while promoting unification. Furthermore, we empirically discover a co-improvement effect of such self-improvement, a phenomenon well known in pre-training but underexplored in post-training. Specifically, as generation improves, understanding becomes more effective at detecting false positives that were previously misclassified as prompt‑aligned. To explain this effect, we extend learning dynamic theory to the MLLM setting, showing that the shared empirical neural tangent kernel between generation and understanding encourages aligned learning dynamics, thereby driving co-improvement. This interplay between generation and understanding further motivates a curriculum learning approach for stronger self‑improvement: progressively enhanced understanding and generation revisit samples underutilized by pre‑trained MLLMs, dynamically expanding post‑training data and leading to improved performance and unification.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a self-improvement framework that addresses the internal gap between understanding and generation in unified multimodal large language models. It resides in the 'Unified Training Strategies and Optimization' leaf, which contains five papers total, including the original work. This leaf sits within the broader 'Unified Architecture Design and Training Paradigms' branch, indicating a moderately populated research direction focused on training methodologies rather than architectural innovations. The taxonomy shows this is an active but not overcrowded area, with sibling works exploring joint fine-tuning and multi-stage pretraining strategies.

The taxonomy reveals several neighboring research directions that contextualize this work. Adjacent leaves include 'Autoregressive Unified Frameworks' (five papers on next-token prediction architectures) and 'Modality Encoding and Alignment Strategies' (five papers on embedding space alignment). The 'Specialized Capability Enhancement' branch addresses orthogonal concerns like grounding and hallucination mitigation, while 'Modality Expansion and Any-to-Any Systems' extends beyond vision-language pairs. The scope note for the parent branch explicitly excludes task-specific optimization, positioning this work as a general training methodology applicable across unified architectures rather than a domain-specific solution.

Among nineteen candidates examined through limited semantic search, the analysis identified one refutable pair for the 'Internal gap-based self-improvement framework' contribution (examined ten candidates). The 'Non-unification score metric' contribution showed no clear refutation across seven candidates examined, suggesting potential novelty in the measurement approach. The 'Learning dynamics theory extension' contribution, examined against only two candidates, also showed no refutation. The modest search scope means these findings reflect top-K semantic matches rather than exhaustive coverage, and the single refutable pair indicates some prior work exists in self-improvement strategies for unified models.

Based on the limited literature search covering nineteen candidates, the work appears to occupy a moderately explored space within unified training strategies. The taxonomy structure suggests the field has established multiple complementary approaches to unification, and this contribution adds a self-improvement perspective to existing joint training methodologies. The analysis does not cover the full breadth of multimodal training literature, particularly works outside the top semantic matches or those published concurrently.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: unifying generation and understanding in multimodal large language models. The field has evolved around several complementary directions. Unified Architecture Design and Training Paradigms explores how to build single models that seamlessly handle both comprehension and synthesis tasks, often through novel tokenization schemes or shared encoder-decoder structures (e.g., Janus[7], Unified Generative Discriminative[3]). Specialized Capability Enhancement focuses on refining particular skills such as grounding, reasoning, or fine-grained visual understanding. Domain-Specific Adaptation targets applications in documents, UI navigation, remote sensing, and other specialized contexts (e.g., Docllm[16], Ferret-UI[17]). Modality Expansion and Any-to-Any Systems push toward handling diverse input-output combinations beyond vision and language, including audio and video (e.g., NExT-GPT[18], Anygpt[26]). Evaluation Frameworks provide benchmarks to measure unified capabilities (e.g., Seed-bench[1]), while Surveys and Comparative Analyses synthesize progress across these branches (e.g., Finetuning Multimodal Survey[4], Unified Multimodal Survey[32]). Within Unified Architecture Design, a particularly active line of work examines training strategies that balance discriminative and generative objectives without architectural bifurcation. Internal Gap Self-Improvement[0] sits squarely in this cluster, emphasizing optimization techniques that iteratively refine both understanding and generation within a single framework. Nearby efforts like Unified Generative Discriminative[3] and UniGen[37] similarly explore joint training regimes, though they may differ in tokenization choices or loss formulations. Another contrasting thread involves decoupled pathways (e.g., Janus-pro[6], Metamorph[14]) that maintain separate streams for each modality or task before late fusion. The central tension across these approaches is whether true unification requires fully shared representations or whether modular designs offer better scalability and task-specific tuning. Internal Gap Self-Improvement[0] contributes to the former perspective by proposing self-improvement mechanisms that tighten the coupling between comprehension and synthesis, distinguishing it from works that rely on more explicit architectural separation.

Claimed Contributions

Non-unification score metric for measuring internal gap in MLLMs

7 retrieved papers

The authors propose a self-consistency metric called the non-unification score that quantifies the generation-understanding gap in unified MLLMs by measuring the proportion of cases where the understanding branch judges generated images as prompt-misaligned. Unlike prior metrics relying on external evaluators, this metric directly measures internal consistency between the two branches.

7 retrieved papers

Internal gap-based self-improvement framework

Can Refute

10 retrieved papers

The authors introduce a self-improvement framework that mitigates the internal gap in MLLMs by using the stronger understanding branch to score and guide the weaker generation branch, without relying on external signals. This framework applies standard post-training strategies like SFT and DPO using preference data constructed from internal understanding judgments.

10 retrieved papers

Can Refute

Learning dynamics theory extension explaining co-improvement effect

2 retrieved papers

The authors extend learning dynamic theory to the multimodal MLLM setting to explain the observed co-improvement phenomenon, where generation-targeted self-improvement also enhances understanding. Their theoretical analysis reveals that a shared empirical neural tangent kernel between generation and understanding encourages aligned learning dynamics, driving the co-improvement effect.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Unified generative and discriminative training for multi-modal large language models PDF

Li JunCheng, Pan, Kaihang, Yu, Qifan, Fei Hao, Ge ZhiQi, Yang, Shuai, Zhang, Hanwang, Qianru Sun (2024)

[4] Recent advances in finetuning multimodal large language models PDF

Zhen Wang, Lin Li, Long Chen (2025)

[14] Metamorph: Multimodal understanding and generation via instruction tuning PDF

Tong, Shengbang, Fan, David, Shengbang Tong, Zhu Jiachen, David Fan, Xiong Yunyang, Jiachen Zhu, Chen Xinlei, Yunyang Xiong, Sinha, Koustuv, Xinlei Chen, Rabbat Michael, Koustuv Sinha, LeCun, Yann, Michael Rabbat, Xie, Saining, Yann LeCun, Liu, Zhuang, Saining Xie, Zhuang Liu (2025)

[37] UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation PDF

Tian Rui, Gao, Mingfei, Rui Tian, Xu, Mingze, Mingfei Gao, Hu Jiaming, Mingze Xu, Lu, Jiasen, Jiaming Hu, Wu, Zuxuan, Jiasen Lu, Yang, Yinfei, Zuxuan Wu, Dehghan, Afshin, Yinfei Yang, Afshin Dehghan (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Non-unification score metric for measuring internal gap in MLLMs

[51] SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models PDF

Cannot Refute

[63] Cross-Modal Consistency in Multimodal Large Language Models PDF

Cannot Refute

[64] Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations? PDF

Cannot Refute

[65] UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception PDF

Cannot Refute

[66] Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights PDF

Cannot Refute

[67] ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning PDF

Cannot Refute

[68] UniGame: Turning a Unified Multimodal Model Into Its Own Adversary PDF

Cannot Refute

Contribution

Internal gap-based self-improvement framework

[51] SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models PDF

Can Refute

[52] DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines PDF

Cannot Refute

[53] Self-improving teacher cultivates better student: Distillation calibration for multimodal large language models PDF

Cannot Refute

[54] The Role of Large Language Models in Improving Diagnostic-Related Groups Assignment and Clinical Decision Support in Healthcare Systems: An Example â¦ PDF

Cannot Refute

[55] EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards PDF

Cannot Refute

[56] A Multimodal Automated Interpretability Agent PDF

Cannot Refute

[57] Reinforcement Learning from Human-like Feedback Enhances Semantic Communication with Multimodal LLMs PDF

Cannot Refute

[58] Agentic Learner with Grow-and-Refine Multimodal Semantic Memory PDF

Cannot Refute

[59] Beyond human data: Aligning multimodal large language models by iterative self-evolution PDF

Cannot Refute

[60] C2-evo: Co-evolving multimodal data and model for self-improving reasoning PDF

Cannot Refute

Contribution

Learning dynamics theory extension explaining co-improvement effect

[61] Telechat technical report PDF

Cannot Refute

[62] Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials PDF

Cannot Refute

Turning Internal Gap into Self-Improvement: Promoting the Generation-Understanding Unification in MLLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Unified generative and discriminative training for multi-modal large language models PDF

[4] Recent advances in finetuning multimodal large language models PDF

[14] Metamorph: Multimodal understanding and generation via instruction tuning PDF

[37] UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation PDF

Contribution Analysis

Non-unification score metric for measuring internal gap in MLLMs

[51] SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models PDF

[63] Cross-Modal Consistency in Multimodal Large Language Models PDF

[64] Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations? PDF

[65] UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception PDF

[66] Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights PDF

[67] ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning PDF

[68] UniGame: Turning a Unified Multimodal Model Into Its Own Adversary PDF

Internal gap-based self-improvement framework

[51] SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models PDF

[52] DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines PDF

[53] Self-improving teacher cultivates better student: Distillation calibration for multimodal large language models PDF

[54] The Role of Large Language Models in Improving Diagnostic-Related Groups Assignment and Clinical Decision Support in Healthcare Systems: An Example â¦ PDF

[55] EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards PDF

[56] A Multimodal Automated Interpretability Agent PDF

[57] Reinforcement Learning from Human-like Feedback Enhances Semantic Communication with Multimodal LLMs PDF

[58] Agentic Learner with Grow-and-Refine Multimodal Semantic Memory PDF

[59] Beyond human data: Aligning multimodal large language models by iterative self-evolution PDF

[60] C2-evo: Co-evolving multimodal data and model for self-improving reasoning PDF

Learning dynamics theory extension explaining co-improvement effect

[61] Telechat technical report PDF

[62] Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials PDF

Table of Contents

[54] The Role of Large Language Models in Improving Diagnostic-Related Groups Assignment and Clinical Decision Support in Healthcare Systems: An Example â¦ PDF