FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Generative modelsFlow matchingany-to-any generation

Any-to-any generation seeks to translate between arbitrary subsets of modalities, enabling flexible cross-modal synthesis. Despite recent success, existing flow-based approaches are challenged by its inefficiency, as they require large-scale datasets often with restrictive pairing constraints, incur high computation cost from modeling joint distribution, and multi-stage training pipeline. We propose \textbf{FlowBind}, an efficient framework for any-to-any generation. Our approach is distinguished by its simplicity: it learns a shared latent space capturing cross-modal information, with modality-specific invertible flows bridging this latent to each modality. Both components are optimized jointly under a single flow-matching objective, and at inference the invertible flows act as encoders and decoders for direct translation across modalities. By factorizing interactions through the shared latent, FlowBind naturally leverages arbitrary subsets of modalities for training, and achieves competitive generation quality while substantially reducing data requirements and computational cost. Experiments on text, image, and audio demonstrate that FlowBind attains comparable quality while requiring up to 6× fewer parameters and training 10× faster than prior methods.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

FlowBind proposes a unified latent space framework for any-to-any generation across text, image, and audio modalities, using modality-specific invertible flows that bridge a shared latent representation. The paper resides in the 'Unified Latent Space Approaches' leaf, which contains only three papers total, including FlowBind itself and two siblings (OmniFlow and Next-omni). This indicates a relatively sparse research direction within the broader any-to-any multi-modal generation landscape, suggesting the approach occupies a less crowded niche compared to specialized pairwise methods like text-image or video-audio translation.

The taxonomy tree reveals that neighboring branches focus on multi-modal rectified flow transformers, text-image joint flow matching, and video-audio synthesis with temporal alignment. FlowBind diverges from these by emphasizing a factorized latent space design rather than direct cross-modal evolution or transformer-based architectures. The 'Any-to-Any Multi-Modal Generation Frameworks' parent branch excludes models limited to specific modality pairs, positioning FlowBind's arbitrary-subset training capability as a distinguishing feature. Nearby specialized applications (visual-tactile mapping, medical imaging) operate in domain-specific contexts, whereas FlowBind targets general-purpose media modalities.

Among twenty-five candidates examined, the contribution-level analysis shows mixed novelty signals. The core FlowBind framework (shared latent plus invertible flows) examined ten candidates with zero refutations, suggesting limited direct overlap in this architectural choice. However, the single-stage joint optimization contribution examined ten candidates and found one refutable match, indicating prior work has explored unified flow-matching objectives. The gradient stopping strategy examined five candidates with no refutations, implying this stabilization technique may be less commonly documented in the limited search scope. These statistics reflect top-K semantic matches, not exhaustive coverage.

Based on the limited search scope of twenty-five candidates, FlowBind appears to introduce a relatively novel architectural factorization for any-to-any generation, though the single-stage optimization approach has precedent. The sparse population of the 'Unified Latent Space Approaches' leaf (three papers) and the absence of refutations for the core framework suggest meaningful differentiation from examined prior work, while acknowledging that broader literature may contain additional relevant methods not captured in this top-K analysis.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: cross-modal generation between arbitrary modalities using flow-based models. The field has evolved from specialized pairwise translation methods—such as text-to-image or video-to-audio pipelines—toward more unified frameworks that handle multiple modalities within a single architecture. The taxonomy reflects this progression through several main branches: Any-to-Any Multi-Modal Generation Frameworks pursue general-purpose systems capable of translating among diverse inputs and outputs (e.g., OmniFlow[3], Next-omni[7]); Text-Image and Video-Audio Cross-Modal Generation branches capture well-established domain-specific methods; Joint Image-Video Generation with Flow Transformers explores temporal consistency across visual media; Specialized Cross-Modal Flow Applications address niche pairings like tactile-visual or environment-aware channels; Flow-Based Cross-Modal Alignment and Fusion focuses on learning shared representations; and Flow-Based Image-to-Image Translation deals with style transfer and domain adaptation. Together, these branches illustrate a shift from task-specific models toward architectures that unify latent spaces and leverage flow matching or continuous normalizing flows to bridge modality gaps. Recent work has concentrated on scaling unified latent space approaches and improving training efficiency across modalities. A key tension lies between designing fully general any-to-any systems—which promise flexibility but may sacrifice per-task performance—and refining specialized pipelines that excel in narrow settings (e.g., Foley-Flow[4] for video-to-audio, VAFlow[6] for similar audio-visual tasks). FlowBind[0] sits within the Unified Latent Space Approaches cluster alongside OmniFlow[3] and Next-omni[7], emphasizing a shared embedding space where flow-based transformations enable bidirectional translation among arbitrary modalities. Compared to OmniFlow[3], which also targets any-to-any generation, FlowBind[0] may differ in architectural choices or the granularity of modality-specific conditioning, while Next-omni[7] explores similar unification goals with potentially distinct flow parameterizations. Open questions remain around how to balance modality-specific inductive biases with the desire for a single, scalable framework, and whether flow-based methods can match or surpass diffusion-based alternatives in quality and computational cost.

Claimed Contributions

FlowBind framework with learnable shared latent and per-modality invertible flows

10 retrieved papers

The authors propose FlowBind, a framework that learns a shared latent space capturing cross-modal information and connects each modality to this latent through modality-specific invertible flows. This factorization enables training with arbitrary paired data while reducing computational cost compared to joint modeling approaches.

10 retrieved papers

Single-stage joint optimization under unified flow-matching objective

Can Refute

10 retrieved papers

The framework trains both the auxiliary encoder (producing the shared latent) and all modality-specific drift networks together using a single flow-matching objective, eliminating the complex multi-stage training procedures required by prior methods like CoDi and OmniFlow.

10 retrieved papers

Can Refute

Gradient stopping strategy for stable encoder learning within flow-matching

5 retrieved papers

The authors introduce a training strategy that stops gradients through the encoder for t>0 while updating it at t=0, which prevents collapse and ensures the encoder learns to minimize conditional variance. This approach achieves stable training without requiring additional contrastive losses or regularizers used in prior direct flow methods.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows PDF

Li Shufan, Konstantinos Kallidromitis, Shufan Li, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, Aditya Grover (2025)

[7] Next-omni: Towards any-to-any omnimodal foundation models with discrete flow matching PDF

Luo Run, Xia, Xiaobo, Run Luo, Wang Lu, Xiaobo Xia, Lu Wang, Longze Chen, LUO Jing, Renke Shan, Yang Min, Jing Luo, Chua, Tat-Seng, Min Yang, Tat-Seng Chua (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FlowBind framework with learnable shared latent and per-modality invertible flows

[5] Bidirectional visual-tactile cross-modal generation using latent feature space flow model PDF

Cannot Refute

[24] MusFlow: Multimodal Music Generation via Conditional Flow Matching PDF

Cannot Refute

[30] Stabilizing invertible neural networks using mixture models PDF

Cannot Refute

[31] A dual-stream feature decomposition network with weight transformation for multi-modality image fusion PDF

Cannot Refute

[32] CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion PDF

Cannot Refute

[33] Unsupervised multi-modal medical image registration via invertible translation PDF

Cannot Refute

[34] Large Generative Models for Different Data Types PDF

Cannot Refute

[35] Farmer: Flow autoregressive transformer over pixels PDF

Cannot Refute

[36] MMIF-AMIN: Adaptive Loss-Driven Multi-Scale Invertible Dense Network for Multimodal Medical Image Fusion PDF

Cannot Refute

[37] Flow-based spatio-temporal structured prediction of motion dynamics PDF

Cannot Refute

Contribution

Single-stage joint optimization under unified flow-matching objective

[7] Next-omni: Towards any-to-any omnimodal foundation models with discrete flow matching PDF

Can Refute

[6] VAFlow: Video-to-Audio Generation with Cross-Modality Flow Matching PDF

Cannot Refute

[22] Full-Atom Peptide Design based on Multi-modal Flow Matching PDF

Cannot Refute

[23] Surface-based Molecular Design with Multi-modal Flow Matching PDF

Cannot Refute

[24] MusFlow: Multimodal Music Generation via Conditional Flow Matching PDF

Cannot Refute

[25] Vfp: Variational flow-matching policy for multi-modal robot manipulation PDF

Cannot Refute

[26] Unified speech and gesture synthesis using flow matching PDF

Cannot Refute

[27] Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow PDF

Cannot Refute

[28] Molform: Multi-modal flow matching for structure-based drug design PDF

Cannot Refute

[29] Flow Matching Imitation Learning for Multi-Support Manipulation PDF

Cannot Refute

Contribution

Gradient stopping strategy for stable encoder learning within flow-matching

[38] Fine-tuning Flow Matching Generative Models with Intermediate Feedback PDF

Cannot Refute

[39] Optimal flow matching: Learning straight trajectories in just one step PDF

Cannot Refute

[40] Trajectory flow matching with applications to clinical time series modelling PDF

Cannot Refute

[41] One-step Flow Matching Generators PDF

Cannot Refute

[42] End-to-End Single-Step Flow Matching via Direct Models PDF

Cannot Refute

FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows PDF

[7] Next-omni: Towards any-to-any omnimodal foundation models with discrete flow matching PDF

Contribution Analysis

FlowBind framework with learnable shared latent and per-modality invertible flows

[5] Bidirectional visual-tactile cross-modal generation using latent feature space flow model PDF

[24] MusFlow: Multimodal Music Generation via Conditional Flow Matching PDF

[30] Stabilizing invertible neural networks using mixture models PDF

[31] A dual-stream feature decomposition network with weight transformation for multi-modality image fusion PDF

[32] CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion PDF

[33] Unsupervised multi-modal medical image registration via invertible translation PDF

[34] Large Generative Models for Different Data Types PDF

[35] Farmer: Flow autoregressive transformer over pixels PDF

[36] MMIF-AMIN: Adaptive Loss-Driven Multi-Scale Invertible Dense Network for Multimodal Medical Image Fusion PDF

[37] Flow-based spatio-temporal structured prediction of motion dynamics PDF

Single-stage joint optimization under unified flow-matching objective

[7] Next-omni: Towards any-to-any omnimodal foundation models with discrete flow matching PDF

[6] VAFlow: Video-to-Audio Generation with Cross-Modality Flow Matching PDF

[22] Full-Atom Peptide Design based on Multi-modal Flow Matching PDF

[23] Surface-based Molecular Design with Multi-modal Flow Matching PDF

[24] MusFlow: Multimodal Music Generation via Conditional Flow Matching PDF

[25] Vfp: Variational flow-matching policy for multi-modal robot manipulation PDF

[26] Unified speech and gesture synthesis using flow matching PDF

[27] Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow PDF

[28] Molform: Multi-modal flow matching for structure-based drug design PDF

[29] Flow Matching Imitation Learning for Multi-Support Manipulation PDF

Gradient stopping strategy for stable encoder learning within flow-matching

[38] Fine-tuning Flow Matching Generative Models with Intermediate Feedback PDF

[39] Optimal flow matching: Learning straight trajectories in just one step PDF

[40] Trajectory flow matching with applications to clinical time series modelling PDF

[41] One-step Flow Matching Generators PDF

[42] End-to-End Single-Step Flow Matching via Direct Models PDF

Table of Contents