NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

OmnimodalMultimodal LearningDiscrete Flow Matching

Next-generation multimodal foundation models capable of any-to-any cross-modal generation and multi-turn interaction will serve as core components of artificial general intelligence systems, playing a pivotal role in human-machine interaction. However, most existing multimodal models remain constrained by autoregressive architectures, whose inherent limitations prevent a balanced integration of understanding and generation capabilities. Although hybrid and decoupling strategies have been explored to address these tasks within unified frameworks separately, their redundant, non-integrated designs limit their applicability to broader scenarios, such as cross-modal retrieval. In this work, we introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms. By leveraging metric-induced probability paths and kinetic optimal velocities, NExT-OMNI natively supports any-to-any understanding and generation with enhanced response efficiency, while enabling broader application scenarios through concise unified representations rather than task-decoupled designs. Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal understanding and generation benchmarks, while outperforming prior unified models in multi-turn multimodal interaction and cross-modal retrieval, highlighting its architectural advantages as a next-generation multimodal foundation model. To advance further research, we will release training details, data protocols, and open-source both the code and model checkpoints.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

NExT-OMNI proposes a unified omnimodal foundation model using discrete flow matching to enable any-to-any cross-modal understanding, generation, and retrieval. The paper resides in the 'Discrete Flow and Latent Space Alignment Approaches' leaf, which contains only two papers including the original work and one sibling (OmniBridge). This represents a relatively sparse research direction within the broader taxonomy of 47 papers, suggesting the discrete flow paradigm for omnimodal modeling remains an emerging area with limited prior exploration.

The taxonomy reveals that neighboring leaves pursue alternative unification strategies: 'Language-Centric Grounding to Visual and Audio Modalities' anchors multimodal capabilities to frozen text LLMs, while 'Large-Scale Omnimodal Pretraining' emphasizes emergent cross-modal abilities through scale. The parent branch 'Unified Omnimodal Foundation Models' explicitly excludes retrieval-augmented and task-specific architectures, positioning NExT-OMNI's unified design in contrast to the 'Multimodal Retrieval-Augmented Generation' branch (14 papers) and domain-specific applications. This structural context highlights the paper's focus on parametric unification rather than external knowledge integration.

Among 30 candidates examined, the core discrete flow contribution shows one refutable candidate from 10 examined, while the reconstruction-enhanced representation appears more novel with zero refutations across 10 candidates. The dynamic generation strategy faces stronger prior work, with two refutable candidates among 10 examined. These statistics reflect a limited search scope focused on semantic similarity, not exhaustive coverage. The varying refutation rates suggest the architectural innovations around representation learning may be more distinctive than the generation optimization techniques.

Based on the top-30 semantic matches, NExT-OMNI occupies a sparsely populated research direction with limited direct precedents in discrete flow-based omnimodal modeling. However, the analysis cannot assess novelty against the broader landscape of flow-based generative models or alternative unification paradigms outside the examined candidates. The taxonomy structure indicates this work diverges from dominant retrieval-augmented and language-centric approaches, though the full extent of differentiation remains constrained by search scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: any-to-any omnimodal understanding generation and retrieval. The field encompasses systems that can flexibly process, generate, and retrieve information across diverse modalities—text, images, audio, video, and beyond. The taxonomy reveals several major branches: Unified Omnimodal Foundation Models aim to build single architectures capable of handling arbitrary modality combinations, often through shared latent spaces or discrete flow techniques. Multimodal Retrieval-Augmented Generation (RAG) focuses on enhancing generation quality by retrieving relevant cross-modal evidence, as seen in works like Multi-RAG[5] and End-to-End Multimodal RAG[7]. Universal Multimodal Retrieval addresses the challenge of searching across heterogeneous data types, while Cross-Modal Alignment and Semantic Understanding explores how to bridge modality gaps through joint embeddings and shared representations. Domain-Specific Multimodal Applications adapt these techniques to specialized contexts, and Multimodal Data and Benchmark Development provides the datasets and evaluation frameworks that drive progress across all branches. Within the Unified Omnimodal Foundation Models branch, a particularly active line of work centers on discrete flow and latent space alignment approaches, which seek to unify modalities by mapping them into common representational spaces. NExT-OMNI[0] exemplifies this direction, emphasizing seamless any-to-any transformations through latent alignment mechanisms. Its close neighbor OmniBridge[1] similarly tackles omnimodal integration but may differ in architectural choices or training strategies for achieving cross-modal coherence. These efforts contrast with retrieval-focused methods like Murag[2] or generation-centric approaches such as Generating Images Multimodal[6], which prioritize specific downstream tasks over universal modality handling. A key open question across these branches is how to balance the flexibility of truly omnimodal systems with the efficiency and specialization gains of task-specific architectures, while ensuring robust alignment when modalities exhibit vastly different statistical properties.

Claimed Contributions

NExT-OMNI omnimodal foundation model based on discrete flow matching

Can Refute

10 retrieved papers

The authors propose NExT-OMNI, the first open-source omnimodal model built entirely on discrete flow matching (DFM) techniques. This model supports any-to-any generation and understanding across text, images, video, and audio within a unified architecture, offering faster inference compared to autoregressive approaches.

10 retrieved papers

Can Refute

Reconstruction-enhanced unified representation with intermediate feature fusion

10 retrieved papers

The authors design a unified representation approach that incorporates reconstruction losses from modality encoders during training. This design enables deep multimodal feature fusion, supporting both precise cross-modal retrieval and multi-turn any-to-any multimodal interactions without requiring task-decoupled architectures.

10 retrieved papers

Dynamic generation strategy and adaptive caching for improved performance

Can Refute

10 retrieved papers

The authors introduce a dynamic length generation strategy that adjusts response lengths in block-size increments based on end-of-sequence confidence, combined with a vanilla adaptive cache mechanism. This approach improves text generation capabilities and achieves 1.2× faster inference compared to autoregressive architectures.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment PDF

Xiao Teng, Li, Zuchao, Zhang, Lefei (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

NExT-OMNI omnimodal foundation model based on discrete flow matching

[61] Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities PDF

Can Refute

[58] Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design PDF

Cannot Refute

[59] VAFlow: Video-to-Audio Generation with Cross-Modality Flow Matching PDF

Cannot Refute

[60] Flow matching with general discrete paths: A kinetic-optimal perspective PDF

Cannot Refute

[62] Make Some Noise: Towards LLM audio reasoning and generation using sound tokens PDF

Cannot Refute

[63] RiboGen: RNA Sequence and Structure Co-Generation with Equivariant MultiFlow PDF

Cannot Refute

[64] Decentralized Autoregressive Generation PDF

Cannot Refute

[65] Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching PDF

Cannot Refute

[66] JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation PDF

Cannot Refute

[67] MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation PDF

Cannot Refute

Contribution

Reconstruction-enhanced unified representation with intermediate feature fusion

[48] SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation PDF

Cannot Refute

[49] Learning Factorized Multimodal Representations PDF

Cannot Refute

[50] Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining PDF

Cannot Refute

[51] Commonality Feature Representation Learning for Unsupervised Multimodal Change Detection PDF

Cannot Refute

[52] M3AE: Multimodal Representation Learning for Brain Tumor Segmentation with Missing Modalities PDF

Cannot Refute

[53] Unified multi-modal image synthesis for missing modality imputation PDF

Cannot Refute

[54] CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation With Transformers PDF

Cannot Refute

[55] Pre-gating and contextual attention gate - A new fusion method for multi-modal data tasks PDF

Cannot Refute

[56] Specificity-guided cross-modal feature reconstruction for RGB-infrared object detection PDF

Cannot Refute

[57] Enhancing multimodal unified representations for cross modal generalization PDF

Cannot Refute

Contribution

Dynamic generation strategy and adaptive caching for improved performance

[68] Diffusion llm with native variable generation lengths: Let lead the way PDF

Can Refute

[69] Adaptive caching for faster video generation with diffusion transformers PDF

Can Refute

[70] {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management PDF

Cannot Refute

[71] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding PDF

Cannot Refute

[72] Tokenselect: Efficient long-context inference and length extrapolation for llms via dynamic token-level kv cache selection PDF

Cannot Refute

[73] Smooth Cache: A Universal Inference Acceleration Technique for Diffusion Transformers PDF

Cannot Refute

[74] SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs PDF

Cannot Refute

[75] RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation PDF

Cannot Refute

[76] Efficient inference of vision instruction-following models with elastic cache PDF

Cannot Refute

[77] UniCP: A Unified Caching and Pruning Framework for Efficient Video Generation PDF

Cannot Refute

NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment PDF

Contribution Analysis

NExT-OMNI omnimodal foundation model based on discrete flow matching

[61] Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities PDF

[58] Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design PDF

[59] VAFlow: Video-to-Audio Generation with Cross-Modality Flow Matching PDF

[60] Flow matching with general discrete paths: A kinetic-optimal perspective PDF

[62] Make Some Noise: Towards LLM audio reasoning and generation using sound tokens PDF

[63] RiboGen: RNA Sequence and Structure Co-Generation with Equivariant MultiFlow PDF

[64] Decentralized Autoregressive Generation PDF

[65] Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching PDF

[66] JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation PDF

[67] MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation PDF

Reconstruction-enhanced unified representation with intermediate feature fusion

[48] SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation PDF

[49] Learning Factorized Multimodal Representations PDF

[50] Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining PDF

[51] Commonality Feature Representation Learning for Unsupervised Multimodal Change Detection PDF

[52] M3AE: Multimodal Representation Learning for Brain Tumor Segmentation with Missing Modalities PDF

[53] Unified multi-modal image synthesis for missing modality imputation PDF

[54] CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation With Transformers PDF

[55] Pre-gating and contextual attention gate - A new fusion method for multi-modal data tasks PDF

[56] Specificity-guided cross-modal feature reconstruction for RGB-infrared object detection PDF

[57] Enhancing multimodal unified representations for cross modal generalization PDF

Dynamic generation strategy and adaptive caching for improved performance

[68] Diffusion llm with native variable generation lengths: Let lead the way PDF

[69] Adaptive caching for faster video generation with diffusion transformers PDF

[70] {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management PDF

[71] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding PDF

[72] Tokenselect: Efficient long-context inference and length extrapolation for llms via dynamic token-level kv cache selection PDF

[73] Smooth Cache: A Universal Inference Acceleration Technique for Diffusion Transformers PDF

[74] SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs PDF

[75] RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation PDF

[76] Efficient inference of vision instruction-following models with elastic cache PDF

[77] UniCP: A Unified Caching and Pruning Framework for Efficient Video Generation PDF

Table of Contents