NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching
Overview
Overall Novelty Assessment
NExT-OMNI proposes a unified omnimodal foundation model using discrete flow matching to enable any-to-any cross-modal understanding, generation, and retrieval. The paper resides in the 'Discrete Flow and Latent Space Alignment Approaches' leaf, which contains only two papers including the original work and one sibling (OmniBridge). This represents a relatively sparse research direction within the broader taxonomy of 47 papers, suggesting the discrete flow paradigm for omnimodal modeling remains an emerging area with limited prior exploration.
The taxonomy reveals that neighboring leaves pursue alternative unification strategies: 'Language-Centric Grounding to Visual and Audio Modalities' anchors multimodal capabilities to frozen text LLMs, while 'Large-Scale Omnimodal Pretraining' emphasizes emergent cross-modal abilities through scale. The parent branch 'Unified Omnimodal Foundation Models' explicitly excludes retrieval-augmented and task-specific architectures, positioning NExT-OMNI's unified design in contrast to the 'Multimodal Retrieval-Augmented Generation' branch (14 papers) and domain-specific applications. This structural context highlights the paper's focus on parametric unification rather than external knowledge integration.
Among 30 candidates examined, the core discrete flow contribution shows one refutable candidate from 10 examined, while the reconstruction-enhanced representation appears more novel with zero refutations across 10 candidates. The dynamic generation strategy faces stronger prior work, with two refutable candidates among 10 examined. These statistics reflect a limited search scope focused on semantic similarity, not exhaustive coverage. The varying refutation rates suggest the architectural innovations around representation learning may be more distinctive than the generation optimization techniques.
Based on the top-30 semantic matches, NExT-OMNI occupies a sparsely populated research direction with limited direct precedents in discrete flow-based omnimodal modeling. However, the analysis cannot assess novelty against the broader landscape of flow-based generative models or alternative unification paradigms outside the examined candidates. The taxonomy structure indicates this work diverges from dominant retrieval-augmented and language-centric approaches, though the full extent of differentiation remains constrained by search scope.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose NExT-OMNI, the first open-source omnimodal model built entirely on discrete flow matching (DFM) techniques. This model supports any-to-any generation and understanding across text, images, video, and audio within a unified architecture, offering faster inference compared to autoregressive approaches.
The authors design a unified representation approach that incorporates reconstruction losses from modality encoders during training. This design enables deep multimodal feature fusion, supporting both precise cross-modal retrieval and multi-turn any-to-any multimodal interactions without requiring task-decoupled architectures.
The authors introduce a dynamic length generation strategy that adjusts response lengths in block-size increments based on end-of-sequence confidence, combined with a vanilla adaptive cache mechanism. This approach improves text generation capabilities and achieves 1.2× faster inference compared to autoregressive architectures.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[1] OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
NExT-OMNI omnimodal foundation model based on discrete flow matching
The authors propose NExT-OMNI, the first open-source omnimodal model built entirely on discrete flow matching (DFM) techniques. This model supports any-to-any generation and understanding across text, images, video, and audio within a unified architecture, offering faster inference compared to autoregressive approaches.
[61] Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities PDF
[58] Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design PDF
[59] VAFlow: Video-to-Audio Generation with Cross-Modality Flow Matching PDF
[60] Flow matching with general discrete paths: A kinetic-optimal perspective PDF
[62] Make Some Noise: Towards LLM audio reasoning and generation using sound tokens PDF
[63] RiboGen: RNA Sequence and Structure Co-Generation with Equivariant MultiFlow PDF
[64] Decentralized Autoregressive Generation PDF
[65] Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching PDF
[66] JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation PDF
[67] MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation PDF
Reconstruction-enhanced unified representation with intermediate feature fusion
The authors design a unified representation approach that incorporates reconstruction losses from modality encoders during training. This design enables deep multimodal feature fusion, supporting both precise cross-modal retrieval and multi-turn any-to-any multimodal interactions without requiring task-decoupled architectures.
[48] SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation PDF
[49] Learning Factorized Multimodal Representations PDF
[50] Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining PDF
[51] Commonality Feature Representation Learning for Unsupervised Multimodal Change Detection PDF
[52] M3AE: Multimodal Representation Learning for Brain Tumor Segmentation with Missing Modalities PDF
[53] Unified multi-modal image synthesis for missing modality imputation PDF
[54] CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation With Transformers PDF
[55] Pre-gating and contextual attention gate - A new fusion method for multi-modal data tasks PDF
[56] Specificity-guided cross-modal feature reconstruction for RGB-infrared object detection PDF
[57] Enhancing multimodal unified representations for cross modal generalization PDF
Dynamic generation strategy and adaptive caching for improved performance
The authors introduce a dynamic length generation strategy that adjusts response lengths in block-size increments based on end-of-sequence confidence, combined with a vanilla adaptive cache mechanism. This approach improves text generation capabilities and achieves 1.2× faster inference compared to autoregressive architectures.