Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation

ICLR 2026 Conference SubmissionAnonymous Authors
Masked Diffusion ModelUnified Multi-modal model
Abstract:

We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation. Unlike existing multimodal MDMs such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution (1024px) text-to-image synthesis. Lavida-O incorporates a novel Elastic Mixture-of-Transformers (Elastic-MoT) architecture that couples a lightweight generation branch with a larger understanding branch, supported by token compression, universal text conditioning and stratified sampling for efficient and high-quality generation. Lavida-O further incorporates planning and iterative self-reflection in image generation and editing tasks, seamlessly boosting generation quality with its understanding capabilities. Lavida-O achieves state-of-the-art performance on a wide range of benchmarks including RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive models and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference. These advances establish Lavida-O as a new paradigm for scalable multimodal reasoning and generation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Lavida-O proposes a unified masked diffusion model for multimodal understanding and generation, supporting image-level understanding, object grounding, image editing, and high-resolution text-to-image synthesis. The paper sits in the 'Elastic and Scalable Masked Diffusion Frameworks' leaf, which contains only two papers total (including Lavida-O itself). This is a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting the specific combination of elastic architectures and high-resolution generation via masked diffusion remains an emerging area rather than a crowded subfield.

The taxonomy tree reveals that Lavida-O's leaf is nested under 'Discrete Diffusion for Vision-Language Tasks', which also includes purely discrete diffusion multimodal LLMs and hybrid autoregressive-diffusion models. Neighboring branches explore continuous diffusion for multimodal generation, causal masked models, and diffusion-based instruction tuning. Lavida-O diverges from these by emphasizing elastic mixture-of-transformers and token compression for scalable generation, whereas sibling work in purely discrete or hybrid categories typically employs fixed architectures or sequential autoregressive-then-diffusion training phases.

Among thirty candidates examined, the unified model contribution shows two refutable candidates out of ten examined, the Elastic Mixture-of-Transformers architecture shows one refutable candidate out of ten, and the planning-and-self-reflection mechanism shows two refutable candidates out of ten. These statistics indicate that each contribution encounters some overlapping prior work within the limited search scope, though the majority of examined candidates do not clearly refute the claims. The elastic architecture and planning mechanisms appear slightly more novel than the unified model framing, given the lower refutation rates.

Based on the top-thirty semantic matches and taxonomy structure, Lavida-O occupies a sparsely populated research direction with modest prior-work overlap. The analysis does not cover exhaustive literature beyond the examined candidates, so additional related work may exist outside this scope. The combination of elastic scaling, high-resolution generation, and self-reflection within a single masked diffusion framework appears less well-explored than purely discrete or hybrid autoregressive-diffusion approaches.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
5
Refutable Paper

Research Landscape Overview

Core task: unified multimodal understanding and generation with masked diffusion models. The field has evolved around several complementary branches that together address the challenge of building systems capable of both perceiving and generating content across vision, language, and other modalities. Unified Multimodal Diffusion Architectures explore end-to-end frameworks that handle multiple modalities within a single model, often leveraging discrete diffusion or masked generative processes to enable flexible input-output configurations—representative works like Unified Multimodal Discrete[1] and Llada-v[2] illustrate this trend. Domain-Specific Multimodal Diffusion Applications adapt these architectures to specialized settings such as medical imaging, robotics, or video synthesis, while Masked Modeling for Multimodal Representation Learning focuses on self-supervised pretraining strategies that learn joint embeddings by reconstructing masked tokens across modalities. Conditional Multimodal Generation with Diffusion emphasizes controllable synthesis guided by cross-modal signals, and Specialized Multimodal Diffusion Techniques develop novel sampling, training, or architectural innovations to improve scalability and quality. Within the Unified Multimodal Diffusion Architectures branch, a particularly active line of work centers on discrete diffusion for vision-language tasks, where models operate in tokenized latent spaces to unify understanding and generation. Lavida-O[0] exemplifies an elastic and scalable masked diffusion framework that adapts flexibly to varying input and output modalities, positioning itself alongside efforts like RIV[27] which also explores scalable discrete representations. Compared to earlier unified approaches such as Peekaboo[3] or Multimodal Conditioned Diffusion[4], Lavida-O[0] emphasizes elasticity—enabling dynamic masking patterns and resolution scaling—rather than fixed conditioning pipelines. This contrasts with domain-specific adaptations like Dmdiff[6] or Motion Masked Diffusion[7], which tailor architectures to particular data types. A central open question across these branches is how to balance the expressiveness of continuous diffusion with the computational efficiency and interpretability of discrete masked models, especially as systems scale to handle longer sequences and richer multimodal interactions.

Claimed Contributions

Lavida-O unified masked diffusion model

The authors introduce Lavida-O, a single framework that unifies image-level understanding, object grounding, image editing, and high-resolution text-to-image synthesis under a masked diffusion modeling objective, achieving state-of-the-art performance across multiple benchmarks.

10 retrieved papers
Can Refute
Elastic Mixture-of-Transformers architecture

The authors propose Elastic-MoT, an architecture that uses a smaller generation branch paired with a larger understanding branch and allows modality-specific attention in later layers, enabling efficient training and flexible parameter activation depending on the task.

10 retrieved papers
Can Refute
Planning and self-reflection mechanisms

The authors introduce explicit mechanisms that leverage the model's understanding capabilities to improve generation: planning (generating layouts or identifying edit regions before synthesis) and reflection (evaluating and correcting generated outputs).

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Lavida-O unified masked diffusion model

The authors introduce Lavida-O, a single framework that unifies image-level understanding, object grounding, image editing, and high-resolution text-to-image synthesis under a masked diffusion modeling objective, achieving state-of-the-art performance across multiple benchmarks.

Contribution

Elastic Mixture-of-Transformers architecture

The authors propose Elastic-MoT, an architecture that uses a smaller generation branch paired with a larger understanding branch and allows modality-specific attention in later layers, enabling efficient training and flexible parameter activation depending on the task.

Contribution

Planning and self-reflection mechanisms

The authors introduce explicit mechanisms that leverage the model's understanding capabilities to improve generation: planning (generating layouts or identifying edit regions before synthesis) and reflection (evaluating and correcting generated outputs).