Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Masked Diffusion ModelUnified Multi-modal model

We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation. Unlike existing multimodal MDMs such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution (1024px) text-to-image synthesis. Lavida-O incorporates a novel Elastic Mixture-of-Transformers (Elastic-MoT) architecture that couples a lightweight generation branch with a larger understanding branch, supported by token compression, universal text conditioning and stratified sampling for efficient and high-quality generation. Lavida-O further incorporates planning and iterative self-reflection in image generation and editing tasks, seamlessly boosting generation quality with its understanding capabilities. Lavida-O achieves state-of-the-art performance on a wide range of benchmarks including RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive models and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference. These advances establish Lavida-O as a new paradigm for scalable multimodal reasoning and generation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Lavida-O proposes a unified masked diffusion model for multimodal understanding and generation, supporting image-level understanding, object grounding, image editing, and high-resolution text-to-image synthesis. The paper sits in the 'Elastic and Scalable Masked Diffusion Frameworks' leaf, which contains only two papers total (including Lavida-O itself). This is a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting the specific combination of elastic architectures and high-resolution generation via masked diffusion remains an emerging area rather than a crowded subfield.

The taxonomy tree reveals that Lavida-O's leaf is nested under 'Discrete Diffusion for Vision-Language Tasks', which also includes purely discrete diffusion multimodal LLMs and hybrid autoregressive-diffusion models. Neighboring branches explore continuous diffusion for multimodal generation, causal masked models, and diffusion-based instruction tuning. Lavida-O diverges from these by emphasizing elastic mixture-of-transformers and token compression for scalable generation, whereas sibling work in purely discrete or hybrid categories typically employs fixed architectures or sequential autoregressive-then-diffusion training phases.

Among thirty candidates examined, the unified model contribution shows two refutable candidates out of ten examined, the Elastic Mixture-of-Transformers architecture shows one refutable candidate out of ten, and the planning-and-self-reflection mechanism shows two refutable candidates out of ten. These statistics indicate that each contribution encounters some overlapping prior work within the limited search scope, though the majority of examined candidates do not clearly refute the claims. The elastic architecture and planning mechanisms appear slightly more novel than the unified model framing, given the lower refutation rates.

Based on the top-thirty semantic matches and taxonomy structure, Lavida-O occupies a sparsely populated research direction with modest prior-work overlap. The analysis does not cover exhaustive literature beyond the examined candidates, so additional related work may exist outside this scope. The combination of elastic scaling, high-resolution generation, and self-reflection within a single masked diffusion framework appears less well-explored than purely discrete or hybrid autoregressive-diffusion approaches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: unified multimodal understanding and generation with masked diffusion models. The field has evolved around several complementary branches that together address the challenge of building systems capable of both perceiving and generating content across vision, language, and other modalities. Unified Multimodal Diffusion Architectures explore end-to-end frameworks that handle multiple modalities within a single model, often leveraging discrete diffusion or masked generative processes to enable flexible input-output configurations—representative works like Unified Multimodal Discrete[1] and Llada-v[2] illustrate this trend. Domain-Specific Multimodal Diffusion Applications adapt these architectures to specialized settings such as medical imaging, robotics, or video synthesis, while Masked Modeling for Multimodal Representation Learning focuses on self-supervised pretraining strategies that learn joint embeddings by reconstructing masked tokens across modalities. Conditional Multimodal Generation with Diffusion emphasizes controllable synthesis guided by cross-modal signals, and Specialized Multimodal Diffusion Techniques develop novel sampling, training, or architectural innovations to improve scalability and quality. Within the Unified Multimodal Diffusion Architectures branch, a particularly active line of work centers on discrete diffusion for vision-language tasks, where models operate in tokenized latent spaces to unify understanding and generation. Lavida-O[0] exemplifies an elastic and scalable masked diffusion framework that adapts flexibly to varying input and output modalities, positioning itself alongside efforts like RIV[27] which also explores scalable discrete representations. Compared to earlier unified approaches such as Peekaboo[3] or Multimodal Conditioned Diffusion[4], Lavida-O[0] emphasizes elasticity—enabling dynamic masking patterns and resolution scaling—rather than fixed conditioning pipelines. This contrasts with domain-specific adaptations like Dmdiff[6] or Motion Masked Diffusion[7], which tailor architectures to particular data types. A central open question across these branches is how to balance the expressiveness of continuous diffusion with the computational efficiency and interpretability of discrete masked models, especially as systems scale to handle longer sequences and richer multimodal interactions.

Claimed Contributions

Lavida-O unified masked diffusion model

Can Refute

10 retrieved papers

The authors introduce Lavida-O, a single framework that unifies image-level understanding, object grounding, image editing, and high-resolution text-to-image synthesis under a masked diffusion modeling objective, achieving state-of-the-art performance across multiple benchmarks.

10 retrieved papers

Can Refute

Elastic Mixture-of-Transformers architecture

Can Refute

10 retrieved papers

The authors propose Elastic-MoT, an architecture that uses a smaller generation branch paired with a larger understanding branch and allows modality-specific attention in later layers, enabling efficient training and flexible parameter activation depending on the task.

10 retrieved papers

Can Refute

Planning and self-reflection mechanisms

Can Refute

10 retrieved papers

The authors introduce explicit mechanisms that leverage the model's understanding capabilities to improve generation: planning (generating layouts or identifying edit regions before synthesis) and reflection (evaluating and correcting generated outputs).

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[27] RIV: Recursive Introspection Mask Diffusion Vision Language Model PDF

Li Yuqian, Qiao, Limeng, YuQian Li, Ma Lin, Limeng Qiao, Lin Ma (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Lavida-O unified masked diffusion model

[1] Unified multimodal discrete diffusion PDF

Can Refute

[52] Mmada: Multimodal large diffusion language models PDF

Can Refute

[2] Llada-v: Large language diffusion models with visual instruction tuning PDF

Cannot Refute

[9] Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction PDF

Cannot Refute

[11] Mcvd-masked conditional video diffusion for prediction, generation, and interpolation PDF

Cannot Refute

[18] Unimo-g: Unified image generation through multimodal conditional diffusion PDF

Cannot Refute

[24] Lavida: A large diffusion language model for multimodal understanding PDF

Cannot Refute

[31] Smooth diffusion model for multimodal recommendation PDF

Cannot Refute

[46] Mdt-a2g: Exploring masked diffusion transformers for co-speech gesture generation PDF

Cannot Refute

[51] Diffusion models for multi-modal generative modeling PDF

Cannot Refute

Contribution

Elastic Mixture-of-Transformers architecture

[71] Growing Visual Generative Capacity for Pre-Trained MLLMs PDF

Can Refute

[63] ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation PDF

Cannot Refute

[64] F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions PDF

Cannot Refute

[65] Mechanisms of symbol processing for in-context learning in transformer networks PDF

Cannot Refute

[66] MGPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation PDF

Cannot Refute

[67] A Combined Encoder and Transformer Approach for Coherent and High-Quality Text Generation PDF

Cannot Refute

[68] InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation PDF

Cannot Refute

[69] Bridging Your Imagination with Audio-Video Generation via a Unified Director PDF

Cannot Refute

[70] Motus: A Unified Latent Action World Model PDF

Cannot Refute

[72] MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning PDF

Cannot Refute

Contribution

Planning and self-reflection mechanisms

[53] AutoDrive-R: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving PDF

Can Refute

[54] GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing PDF

Can Refute

[55] Self-Supervised Visual Planning with Temporal Skip Connections PDF

Cannot Refute

[56] Self-Reflective Reinforcement Learning for Diffusion-based Image Reasoning Generation PDF

Cannot Refute

[57] Long-horizon visual imitation learning via plan and code reflection PDF

Cannot Refute

[58] Finerag: Fine-grained retrieval-augmented text-to-image generation PDF

Cannot Refute

[59] CREA: A Collaborative Multi-Agent Framework for Creative Image Editing and Generation PDF

Cannot Refute

[60] EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing PDF

Cannot Refute

[61] Generative Ai planning robustness PDF

Cannot Refute

[62] V-ALPHASOCIAL: Benchmark and Self-Reflective Chain-of-Thought Generation for Visual Social Commonsense Reasoning PDF

Cannot Refute

Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[27] RIV: Recursive Introspection Mask Diffusion Vision Language Model PDF

Contribution Analysis

Lavida-O unified masked diffusion model

[1] Unified multimodal discrete diffusion PDF

[52] Mmada: Multimodal large diffusion language models PDF

[2] Llada-v: Large language diffusion models with visual instruction tuning PDF

[9] Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction PDF

[11] Mcvd-masked conditional video diffusion for prediction, generation, and interpolation PDF

[18] Unimo-g: Unified image generation through multimodal conditional diffusion PDF

[24] Lavida: A large diffusion language model for multimodal understanding PDF

[31] Smooth diffusion model for multimodal recommendation PDF

[46] Mdt-a2g: Exploring masked diffusion transformers for co-speech gesture generation PDF

[51] Diffusion models for multi-modal generative modeling PDF

Elastic Mixture-of-Transformers architecture

[71] Growing Visual Generative Capacity for Pre-Trained MLLMs PDF

[63] ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation PDF

[64] F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions PDF

[65] Mechanisms of symbol processing for in-context learning in transformer networks PDF

[66] MGPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation PDF

[67] A Combined Encoder and Transformer Approach for Coherent and High-Quality Text Generation PDF

[68] InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation PDF

[69] Bridging Your Imagination with Audio-Video Generation via a Unified Director PDF

[70] Motus: A Unified Latent Action World Model PDF

[72] MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning PDF

Planning and self-reflection mechanisms

[53] AutoDrive-R: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving PDF

[54] GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing PDF

[55] Self-Supervised Visual Planning with Temporal Skip Connections PDF

[56] Self-Reflective Reinforcement Learning for Diffusion-based Image Reasoning Generation PDF

[57] Long-horizon visual imitation learning via plan and code reflection PDF

[58] Finerag: Fine-grained retrieval-augmented text-to-image generation PDF

[59] CREA: A Collaborative Multi-Agent Framework for Creative Image Editing and Generation PDF

[60] EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing PDF

[61] Generative Ai planning robustness PDF

[62] V-ALPHASOCIAL: Benchmark and Self-Reflective Chain-of-Thought Generation for Visual Social Commonsense Reasoning PDF

Table of Contents