NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

ICLR 2026 Conference SubmissionAnonymous Authors
OmnimodalMultimodal LearningDiscrete Flow Matching
Abstract:

Next-generation multimodal foundation models capable of any-to-any cross-modal generation and multi-turn interaction will serve as core components of artificial general intelligence systems, playing a pivotal role in human-machine interaction. However, most existing multimodal models remain constrained by autoregressive architectures, whose inherent limitations prevent a balanced integration of understanding and generation capabilities. Although hybrid and decoupling strategies have been explored to address these tasks within unified frameworks separately, their redundant, non-integrated designs limit their applicability to broader scenarios, such as cross-modal retrieval. In this work, we introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms. By leveraging metric-induced probability paths and kinetic optimal velocities, NExT-OMNI natively supports any-to-any understanding and generation with enhanced response efficiency, while enabling broader application scenarios through concise unified representations rather than task-decoupled designs. Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal understanding and generation benchmarks, while outperforming prior unified models in multi-turn multimodal interaction and cross-modal retrieval, highlighting its architectural advantages as a next-generation multimodal foundation model. To advance further research, we will release training details, data protocols, and open-source both the code and model checkpoints.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

NExT-OMNI proposes a unified omnimodal foundation model using discrete flow matching to enable any-to-any cross-modal understanding, generation, and retrieval. The paper resides in the 'Discrete Flow and Latent Space Alignment Approaches' leaf, which contains only two papers including the original work and one sibling (OmniBridge). This represents a relatively sparse research direction within the broader taxonomy of 47 papers, suggesting the discrete flow paradigm for omnimodal modeling remains an emerging area with limited prior exploration.

The taxonomy reveals that neighboring leaves pursue alternative unification strategies: 'Language-Centric Grounding to Visual and Audio Modalities' anchors multimodal capabilities to frozen text LLMs, while 'Large-Scale Omnimodal Pretraining' emphasizes emergent cross-modal abilities through scale. The parent branch 'Unified Omnimodal Foundation Models' explicitly excludes retrieval-augmented and task-specific architectures, positioning NExT-OMNI's unified design in contrast to the 'Multimodal Retrieval-Augmented Generation' branch (14 papers) and domain-specific applications. This structural context highlights the paper's focus on parametric unification rather than external knowledge integration.

Among 30 candidates examined, the core discrete flow contribution shows one refutable candidate from 10 examined, while the reconstruction-enhanced representation appears more novel with zero refutations across 10 candidates. The dynamic generation strategy faces stronger prior work, with two refutable candidates among 10 examined. These statistics reflect a limited search scope focused on semantic similarity, not exhaustive coverage. The varying refutation rates suggest the architectural innovations around representation learning may be more distinctive than the generation optimization techniques.

Based on the top-30 semantic matches, NExT-OMNI occupies a sparsely populated research direction with limited direct precedents in discrete flow-based omnimodal modeling. However, the analysis cannot assess novelty against the broader landscape of flow-based generative models or alternative unification paradigms outside the examined candidates. The taxonomy structure indicates this work diverges from dominant retrieval-augmented and language-centric approaches, though the full extent of differentiation remains constrained by search scope.

Taxonomy

Core-task Taxonomy Papers
47
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: any-to-any omnimodal understanding generation and retrieval. The field encompasses systems that can flexibly process, generate, and retrieve information across diverse modalities—text, images, audio, video, and beyond. The taxonomy reveals several major branches: Unified Omnimodal Foundation Models aim to build single architectures capable of handling arbitrary modality combinations, often through shared latent spaces or discrete flow techniques. Multimodal Retrieval-Augmented Generation (RAG) focuses on enhancing generation quality by retrieving relevant cross-modal evidence, as seen in works like Multi-RAG[5] and End-to-End Multimodal RAG[7]. Universal Multimodal Retrieval addresses the challenge of searching across heterogeneous data types, while Cross-Modal Alignment and Semantic Understanding explores how to bridge modality gaps through joint embeddings and shared representations. Domain-Specific Multimodal Applications adapt these techniques to specialized contexts, and Multimodal Data and Benchmark Development provides the datasets and evaluation frameworks that drive progress across all branches. Within the Unified Omnimodal Foundation Models branch, a particularly active line of work centers on discrete flow and latent space alignment approaches, which seek to unify modalities by mapping them into common representational spaces. NExT-OMNI[0] exemplifies this direction, emphasizing seamless any-to-any transformations through latent alignment mechanisms. Its close neighbor OmniBridge[1] similarly tackles omnimodal integration but may differ in architectural choices or training strategies for achieving cross-modal coherence. These efforts contrast with retrieval-focused methods like Murag[2] or generation-centric approaches such as Generating Images Multimodal[6], which prioritize specific downstream tasks over universal modality handling. A key open question across these branches is how to balance the flexibility of truly omnimodal systems with the efficiency and specialization gains of task-specific architectures, while ensuring robust alignment when modalities exhibit vastly different statistical properties.

Claimed Contributions

NExT-OMNI omnimodal foundation model based on discrete flow matching

The authors propose NExT-OMNI, the first open-source omnimodal model built entirely on discrete flow matching (DFM) techniques. This model supports any-to-any generation and understanding across text, images, video, and audio within a unified architecture, offering faster inference compared to autoregressive approaches.

10 retrieved papers
Can Refute
Reconstruction-enhanced unified representation with intermediate feature fusion

The authors design a unified representation approach that incorporates reconstruction losses from modality encoders during training. This design enables deep multimodal feature fusion, supporting both precise cross-modal retrieval and multi-turn any-to-any multimodal interactions without requiring task-decoupled architectures.

10 retrieved papers
Dynamic generation strategy and adaptive caching for improved performance

The authors introduce a dynamic length generation strategy that adjusts response lengths in block-size increments based on end-of-sequence confidence, combined with a vanilla adaptive cache mechanism. This approach improves text generation capabilities and achieves 1.2× faster inference compared to autoregressive architectures.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

NExT-OMNI omnimodal foundation model based on discrete flow matching

The authors propose NExT-OMNI, the first open-source omnimodal model built entirely on discrete flow matching (DFM) techniques. This model supports any-to-any generation and understanding across text, images, video, and audio within a unified architecture, offering faster inference compared to autoregressive approaches.

Contribution

Reconstruction-enhanced unified representation with intermediate feature fusion

The authors design a unified representation approach that incorporates reconstruction losses from modality encoders during training. This design enables deep multimodal feature fusion, supporting both precise cross-modal retrieval and multi-turn any-to-any multimodal interactions without requiring task-decoupled architectures.

Contribution

Dynamic generation strategy and adaptive caching for improved performance

The authors introduce a dynamic length generation strategy that adjusts response lengths in block-size increments based on end-of-sequence confidence, combined with a vanilla adaptive cache mechanism. This approach improves text generation capabilities and achieves 1.2× faster inference compared to autoregressive architectures.