Abstract:

Although unified MLLMs aim to unify generation and understanding, they are considered to exhibit an internal gap, with understanding outperforming generation. Through large‑scale evaluation across multiple MLLMs and tasks, we confirm the widespread non‑unification of MLLMs, and demonstrate that it indeed stems from weak generation rather than misunderstanding. This finding motivates us to propose a simple yet effective internal gap-based self-improvement framework, which mitigates internal gaps by leveraging stronger understanding to guide weaker generation without relying on any external signals. We validate this strategy through comprehensive experiments: scoring generations with understanding to construct image data for post-training (e.g., SFT and DPO) significantly improves generation while promoting unification. Furthermore, we empirically discover a co-improvement effect of such self-improvement, a phenomenon well known in pre-training but underexplored in post-training. Specifically, as generation improves, understanding becomes more effective at detecting false positives that were previously misclassified as prompt‑aligned. To explain this effect, we extend learning dynamic theory to the MLLM setting, showing that the shared empirical neural tangent kernel between generation and understanding encourages aligned learning dynamics, thereby driving co-improvement. This interplay between generation and understanding further motivates a curriculum learning approach for stronger self‑improvement: progressively enhanced understanding and generation revisit samples underutilized by pre‑trained MLLMs, dynamically expanding post‑training data and leading to improved performance and unification.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a self-improvement framework that addresses the internal gap between understanding and generation in unified multimodal large language models. It resides in the 'Unified Training Strategies and Optimization' leaf, which contains five papers total, including the original work. This leaf sits within the broader 'Unified Architecture Design and Training Paradigms' branch, indicating a moderately populated research direction focused on training methodologies rather than architectural innovations. The taxonomy shows this is an active but not overcrowded area, with sibling works exploring joint fine-tuning and multi-stage pretraining strategies.

The taxonomy reveals several neighboring research directions that contextualize this work. Adjacent leaves include 'Autoregressive Unified Frameworks' (five papers on next-token prediction architectures) and 'Modality Encoding and Alignment Strategies' (five papers on embedding space alignment). The 'Specialized Capability Enhancement' branch addresses orthogonal concerns like grounding and hallucination mitigation, while 'Modality Expansion and Any-to-Any Systems' extends beyond vision-language pairs. The scope note for the parent branch explicitly excludes task-specific optimization, positioning this work as a general training methodology applicable across unified architectures rather than a domain-specific solution.

Among nineteen candidates examined through limited semantic search, the analysis identified one refutable pair for the 'Internal gap-based self-improvement framework' contribution (examined ten candidates). The 'Non-unification score metric' contribution showed no clear refutation across seven candidates examined, suggesting potential novelty in the measurement approach. The 'Learning dynamics theory extension' contribution, examined against only two candidates, also showed no refutation. The modest search scope means these findings reflect top-K semantic matches rather than exhaustive coverage, and the single refutable pair indicates some prior work exists in self-improvement strategies for unified models.

Based on the limited literature search covering nineteen candidates, the work appears to occupy a moderately explored space within unified training strategies. The taxonomy structure suggests the field has established multiple complementary approaches to unification, and this contribution adds a self-improvement perspective to existing joint training methodologies. The analysis does not cover the full breadth of multimodal training literature, particularly works outside the top semantic matches or those published concurrently.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
19
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: unifying generation and understanding in multimodal large language models. The field has evolved around several complementary directions. Unified Architecture Design and Training Paradigms explores how to build single models that seamlessly handle both comprehension and synthesis tasks, often through novel tokenization schemes or shared encoder-decoder structures (e.g., Janus[7], Unified Generative Discriminative[3]). Specialized Capability Enhancement focuses on refining particular skills such as grounding, reasoning, or fine-grained visual understanding. Domain-Specific Adaptation targets applications in documents, UI navigation, remote sensing, and other specialized contexts (e.g., Docllm[16], Ferret-UI[17]). Modality Expansion and Any-to-Any Systems push toward handling diverse input-output combinations beyond vision and language, including audio and video (e.g., NExT-GPT[18], Anygpt[26]). Evaluation Frameworks provide benchmarks to measure unified capabilities (e.g., Seed-bench[1]), while Surveys and Comparative Analyses synthesize progress across these branches (e.g., Finetuning Multimodal Survey[4], Unified Multimodal Survey[32]). Within Unified Architecture Design, a particularly active line of work examines training strategies that balance discriminative and generative objectives without architectural bifurcation. Internal Gap Self-Improvement[0] sits squarely in this cluster, emphasizing optimization techniques that iteratively refine both understanding and generation within a single framework. Nearby efforts like Unified Generative Discriminative[3] and UniGen[37] similarly explore joint training regimes, though they may differ in tokenization choices or loss formulations. Another contrasting thread involves decoupled pathways (e.g., Janus-pro[6], Metamorph[14]) that maintain separate streams for each modality or task before late fusion. The central tension across these approaches is whether true unification requires fully shared representations or whether modular designs offer better scalability and task-specific tuning. Internal Gap Self-Improvement[0] contributes to the former perspective by proposing self-improvement mechanisms that tighten the coupling between comprehension and synthesis, distinguishing it from works that rely on more explicit architectural separation.

Claimed Contributions

Non-unification score metric for measuring internal gap in MLLMs

The authors propose a self-consistency metric called the non-unification score that quantifies the generation-understanding gap in unified MLLMs by measuring the proportion of cases where the understanding branch judges generated images as prompt-misaligned. Unlike prior metrics relying on external evaluators, this metric directly measures internal consistency between the two branches.

7 retrieved papers
Internal gap-based self-improvement framework

The authors introduce a self-improvement framework that mitigates the internal gap in MLLMs by using the stronger understanding branch to score and guide the weaker generation branch, without relying on external signals. This framework applies standard post-training strategies like SFT and DPO using preference data constructed from internal understanding judgments.

10 retrieved papers
Can Refute
Learning dynamics theory extension explaining co-improvement effect

The authors extend learning dynamic theory to the multimodal MLLM setting to explain the observed co-improvement phenomenon, where generation-targeted self-improvement also enhances understanding. Their theoretical analysis reveals that a shared empirical neural tangent kernel between generation and understanding encourages aligned learning dynamics, driving the co-improvement effect.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Non-unification score metric for measuring internal gap in MLLMs

The authors propose a self-consistency metric called the non-unification score that quantifies the generation-understanding gap in unified MLLMs by measuring the proportion of cases where the understanding branch judges generated images as prompt-misaligned. Unlike prior metrics relying on external evaluators, this metric directly measures internal consistency between the two branches.

Contribution

Internal gap-based self-improvement framework

The authors introduce a self-improvement framework that mitigates the internal gap in MLLMs by using the stronger understanding branch to score and guide the weaker generation branch, without relying on external signals. This framework applies standard post-training strategies like SFT and DPO using preference data constructed from internal understanding judgments.

Contribution

Learning dynamics theory extension explaining co-improvement effect

The authors extend learning dynamic theory to the multimodal MLLM setting to explain the observed co-improvement phenomenon, where generation-targeted self-improvement also enhances understanding. Their theoretical analysis reveals that a shared empirical neural tangent kernel between generation and understanding encourages aligned learning dynamics, driving the co-improvement effect.