FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows

ICLR 2026 Conference SubmissionAnonymous Authors
Generative modelsFlow matchingany-to-any generation
Abstract:

Any-to-any generation seeks to translate between arbitrary subsets of modalities, enabling flexible cross-modal synthesis. Despite recent success, existing flow-based approaches are challenged by its inefficiency, as they require large-scale datasets often with restrictive pairing constraints, incur high computation cost from modeling joint distribution, and multi-stage training pipeline. We propose \textbf{FlowBind}, an efficient framework for any-to-any generation. Our approach is distinguished by its simplicity: it learns a shared latent space capturing cross-modal information, with modality-specific invertible flows bridging this latent to each modality. Both components are optimized jointly under a single flow-matching objective, and at inference the invertible flows act as encoders and decoders for direct translation across modalities. By factorizing interactions through the shared latent, FlowBind naturally leverages arbitrary subsets of modalities for training, and achieves competitive generation quality while substantially reducing data requirements and computational cost. Experiments on text, image, and audio demonstrate that FlowBind attains comparable quality while requiring up to 6× fewer parameters and training 10× faster than prior methods.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

FlowBind proposes a unified latent space framework for any-to-any generation across text, image, and audio modalities, using modality-specific invertible flows that bridge a shared latent representation. The paper resides in the 'Unified Latent Space Approaches' leaf, which contains only three papers total, including FlowBind itself and two siblings (OmniFlow and Next-omni). This indicates a relatively sparse research direction within the broader any-to-any multi-modal generation landscape, suggesting the approach occupies a less crowded niche compared to specialized pairwise methods like text-image or video-audio translation.

The taxonomy tree reveals that neighboring branches focus on multi-modal rectified flow transformers, text-image joint flow matching, and video-audio synthesis with temporal alignment. FlowBind diverges from these by emphasizing a factorized latent space design rather than direct cross-modal evolution or transformer-based architectures. The 'Any-to-Any Multi-Modal Generation Frameworks' parent branch excludes models limited to specific modality pairs, positioning FlowBind's arbitrary-subset training capability as a distinguishing feature. Nearby specialized applications (visual-tactile mapping, medical imaging) operate in domain-specific contexts, whereas FlowBind targets general-purpose media modalities.

Among twenty-five candidates examined, the contribution-level analysis shows mixed novelty signals. The core FlowBind framework (shared latent plus invertible flows) examined ten candidates with zero refutations, suggesting limited direct overlap in this architectural choice. However, the single-stage joint optimization contribution examined ten candidates and found one refutable match, indicating prior work has explored unified flow-matching objectives. The gradient stopping strategy examined five candidates with no refutations, implying this stabilization technique may be less commonly documented in the limited search scope. These statistics reflect top-K semantic matches, not exhaustive coverage.

Based on the limited search scope of twenty-five candidates, FlowBind appears to introduce a relatively novel architectural factorization for any-to-any generation, though the single-stage optimization approach has precedent. The sparse population of the 'Unified Latent Space Approaches' leaf (three papers) and the absence of refutations for the core framework suggest meaningful differentiation from examined prior work, while acknowledging that broader literature may contain additional relevant methods not captured in this top-K analysis.

Taxonomy

Core-task Taxonomy Papers
21
3
Claimed Contributions
25
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: cross-modal generation between arbitrary modalities using flow-based models. The field has evolved from specialized pairwise translation methods—such as text-to-image or video-to-audio pipelines—toward more unified frameworks that handle multiple modalities within a single architecture. The taxonomy reflects this progression through several main branches: Any-to-Any Multi-Modal Generation Frameworks pursue general-purpose systems capable of translating among diverse inputs and outputs (e.g., OmniFlow[3], Next-omni[7]); Text-Image and Video-Audio Cross-Modal Generation branches capture well-established domain-specific methods; Joint Image-Video Generation with Flow Transformers explores temporal consistency across visual media; Specialized Cross-Modal Flow Applications address niche pairings like tactile-visual or environment-aware channels; Flow-Based Cross-Modal Alignment and Fusion focuses on learning shared representations; and Flow-Based Image-to-Image Translation deals with style transfer and domain adaptation. Together, these branches illustrate a shift from task-specific models toward architectures that unify latent spaces and leverage flow matching or continuous normalizing flows to bridge modality gaps. Recent work has concentrated on scaling unified latent space approaches and improving training efficiency across modalities. A key tension lies between designing fully general any-to-any systems—which promise flexibility but may sacrifice per-task performance—and refining specialized pipelines that excel in narrow settings (e.g., Foley-Flow[4] for video-to-audio, VAFlow[6] for similar audio-visual tasks). FlowBind[0] sits within the Unified Latent Space Approaches cluster alongside OmniFlow[3] and Next-omni[7], emphasizing a shared embedding space where flow-based transformations enable bidirectional translation among arbitrary modalities. Compared to OmniFlow[3], which also targets any-to-any generation, FlowBind[0] may differ in architectural choices or the granularity of modality-specific conditioning, while Next-omni[7] explores similar unification goals with potentially distinct flow parameterizations. Open questions remain around how to balance modality-specific inductive biases with the desire for a single, scalable framework, and whether flow-based methods can match or surpass diffusion-based alternatives in quality and computational cost.

Claimed Contributions

FlowBind framework with learnable shared latent and per-modality invertible flows

The authors propose FlowBind, a framework that learns a shared latent space capturing cross-modal information and connects each modality to this latent through modality-specific invertible flows. This factorization enables training with arbitrary paired data while reducing computational cost compared to joint modeling approaches.

10 retrieved papers
Single-stage joint optimization under unified flow-matching objective

The framework trains both the auxiliary encoder (producing the shared latent) and all modality-specific drift networks together using a single flow-matching objective, eliminating the complex multi-stage training procedures required by prior methods like CoDi and OmniFlow.

10 retrieved papers
Can Refute
Gradient stopping strategy for stable encoder learning within flow-matching

The authors introduce a training strategy that stops gradients through the encoder for t>0 while updating it at t=0, which prevents collapse and ensures the encoder learns to minimize conditional variance. This approach achieves stable training without requiring additional contrastive losses or regularizers used in prior direct flow methods.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FlowBind framework with learnable shared latent and per-modality invertible flows

The authors propose FlowBind, a framework that learns a shared latent space capturing cross-modal information and connects each modality to this latent through modality-specific invertible flows. This factorization enables training with arbitrary paired data while reducing computational cost compared to joint modeling approaches.

Contribution

Single-stage joint optimization under unified flow-matching objective

The framework trains both the auxiliary encoder (producing the shared latent) and all modality-specific drift networks together using a single flow-matching objective, eliminating the complex multi-stage training procedures required by prior methods like CoDi and OmniFlow.

Contribution

Gradient stopping strategy for stable encoder learning within flow-matching

The authors introduce a training strategy that stops gradients through the encoder for t>0 while updating it at t=0, which prevents collapse and ensures the encoder learns to minimize conditional variance. This approach achieves stable training without requiring additional contrastive losses or regularizers used in prior direct flow methods.