RMFlow: Refined Mean Flow by a Noise-Injection Step for Multimodal Generation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Mean FlowFlow MatchingNoise-injectionLikelihood MaximizationMultimodal Generation

Mean flow (MeanFlow) enables efficient, high-fidelity image generation, yet its single-function evaluation (1-NFE) generation often cannot yield compelling results. We address this issue by introducing RMFlow, an efficient multimodal generative model that integrates a coarse 1-NFE MeanFlow transport with a subsequent tailored noise-injection refinement step. RMFlow approximates the average velocity of the flow path using a neural network trained with a new loss function that balances minimizing the Wasserstein distance between probability paths and maximizing sample likelihood. RMFlow achieves competitive, often (near) state-of-the-art results on text-to-image, context-to-molecule, and time-series generation using 1-NFE, at a comparable computational cost to the baseline MeanFlows.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

RMFlow proposes a single-step multimodal generative model combining coarse mean flow transport with noise-injection refinement, targeting efficient generation across text-to-image, molecule synthesis, and time-series tasks. The paper resides in the Mean Flow Acceleration leaf, which contains only two papers total (RMFlow and one sibling). This represents a relatively sparse research direction within the broader taxonomy of 43 papers across 36 topics, suggesting the specific combination of mean flow prediction with refinement steps is not yet heavily explored in the literature examined.

The Mean Flow Acceleration leaf sits within the Distillation and Acceleration Techniques branch, which also includes Distribution Matching Distillation methods. Neighboring branches address trajectory optimization (Straighter Flow Matching, Motion Flow Matching) and discrete flow extensions, while application domains span audio synthesis, motion generation, and unified multimodal frameworks. RMFlow's approach diverges from explicit distillation losses used in distribution matching methods, instead learning average velocity fields directly. The taxonomy structure indicates acceleration research splits between mean flow prediction and teacher-student distillation paradigms, with RMFlow pursuing the former path.

Among 28 candidates examined, the training objective contribution (balancing Wasserstein distance and likelihood maximization) shows one refutable candidate from 10 examined, indicating some prior theoretical work exists in this space. The 1-NFE architecture contribution examined 8 candidates with none clearly refuting it, while the benchmark results contribution examined 10 candidates with no refutations found. The limited search scope (28 papers, not hundreds) means these statistics reflect top semantic matches and citation neighbors rather than exhaustive coverage. The architecture and empirical results appear more distinctive than the theoretical training objective within this bounded search.

Based on the top-28 semantic matches examined, RMFlow occupies a sparsely populated research direction (2-paper leaf) addressing single-step multimodal generation. The architecture combining mean flow with refinement shows no clear prior overlap among candidates examined, though the training objective has at least one related predecessor. The analysis covers semantically proximate work and citation neighbors but does not claim exhaustive field coverage, leaving open whether additional relevant methods exist beyond this search scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient multimodal generation with single-step flow matching. The field has organized itself around several complementary research directions. Core Flow Matching Architectures and Training Objectives establish foundational methods for learning probability paths, including works like Motion Flow Matching[7] and Straighter Flow Matching[20] that refine trajectory design. Distillation and Acceleration Techniques focus on reducing computational costs through approaches such as mean flow acceleration and generator matching, exemplified by Distribution Matching Distillation[1] and Flow Generator Matching[14]. Application Domains span diverse modalities including audio generation (MMAudio[11], MusFlow[3]), motion synthesis (VersatileMotion[37]), and speech-driven tasks (TechSinger[12]). Unified Multimodal Frameworks like NExT-OMNI[19] and MammothModa[31] integrate multiple modalities within single architectures, while Quantization and Deployment Optimization addresses practical deployment through methods like Quantization Diffusion Models[22]. Evaluation and Analysis Methods provide tools for assessing generation quality and model behavior across these varied settings. Within the acceleration landscape, a particularly active line explores mean flow techniques that enable few-step or single-step inference by learning to predict flow endpoints or intermediate statistics directly. RMFlow[0] sits squarely in this branch alongside MeanFlow Accelerated[29], both targeting rapid generation by bypassing iterative sampling. These methods contrast with distillation approaches like Distribution Matching Distillation[1] that transfer knowledge from multi-step teachers, and with architectural innovations like VAFlow[5] and GoalFlow[6] that modify the flow structure itself. The central trade-off involves balancing generation quality against inference speed: while RMFlow[0] emphasizes single-step efficiency through mean flow prediction, neighboring works explore whether additional architectural constraints or alternative training objectives can preserve sample diversity and fidelity when collapsing multi-step processes into one-shot generation.

Claimed Contributions

RMFlow: 1-NFE multimodal generative model with noise-injection refinement

8 retrieved papers

RMFlow is a new generative model that improves upon MeanFlow by combining a single-step (1-NFE) mean flow transport with a subsequent noise-injection refinement step. This design enables efficient, high-quality generation across multiple modalities including text-to-image, context-to-molecule, and time-series tasks.

8 retrieved papers

Theoretically principled training objective balancing Wasserstein distance and likelihood maximization

Can Refute

10 retrieved papers

The authors introduce a novel loss function that jointly optimizes the MeanFlow objective (which controls Wasserstein distance) with a likelihood maximization term derived from the noise-injection step. This combined objective provides theoretical guarantees for both distributional alignment and sample quality.

10 retrieved papers

Can Refute

Near state-of-the-art results on benchmark generation tasks using only 1-NFE

10 retrieved papers

The authors demonstrate that RMFlow achieves competitive or state-of-the-art performance on multiple benchmark tasks (text-to-image on COCO, context-to-molecule on QM9, and time-series forecasting) while requiring only a single neural network evaluation, matching the computational cost of baseline MeanFlows.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[29] MeanFlow-Accelerated Multimodal Video-to-Audio Synthesis via One-Step Generation PDF

Yang Xiaoran, Xiaoran Yang, Guo Xinyue, Jianxuan Yang, Wang Haoyu, Xinyue Guo, Pan Ningning, Haoyu Wang, Huang Gongping, Ningning Pan, Gongping Huang (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RMFlow: 1-NFE multimodal generative model with noise-injection refinement

[44] ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities PDF

Cannot Refute

[45] Short-Term Residential Load Forecasting Based on Generative Diffusion Models and Attention Mechanisms PDF

Cannot Refute

[46] A versatile diffusion transformer with mixture of noise levels for audiovisual generation PDF

Cannot Refute

[47] MDG: Masked Denoising Generation for Multi-Agent Behavior Modeling in Traffic Environments PDF

Cannot Refute

[48] Anomaly Detection for Multivariate Industrial Time Series Based on Consistency Models PDF

Cannot Refute

[49] InfoDCL: Informative Noise Enhanced Diffusion Based Contrastive Learning PDF

Cannot Refute

[50] Guidance-Driven Visual Synthesis with Generative Models PDF

Cannot Refute

[51] Multi-Class Brain Stroke Segmentation Using Stable Diffusion PDF

Cannot Refute

Contribution

Theoretically principled training objective balancing Wasserstein distance and likelihood maximization

[64] Sliced-Wasserstein normalizing flows: beyond maximum likelihood training PDF

Can Refute

[62] Some advances in Bayesian inference and generative modeling PDF

Cannot Refute

[63] Wasserstein of Wasserstein loss for learning generative models PDF

Cannot Refute

[65] A Likelihood Based Approach to Distribution Regression Using Conditional Deep Generative Models PDF

Cannot Refute

[66] Wasserstein learning of deep generative point process models PDF

Cannot Refute

[67] Convergence of flow-based generative models via proximal gradient descent in wasserstein space PDF

Cannot Refute

[68] Inferential Wasserstein generative adversarial networks PDF

Cannot Refute

[69] Unsupervised approaches based on optimal transport and convex analysis for inverse problems in imaging PDF

Cannot Refute

[70] Wasserstein generative adversarial networks for modeling marked events. PDF

Cannot Refute

[71] Generative Adversarial Networks based on optimal transport: a survey PDF

Cannot Refute

Contribution

Near state-of-the-art results on benchmark generation tasks using only 1-NFE

[52] Scaling up GANs for Text-to-Image Synthesis PDF

Cannot Refute

[53] InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning PDF

Cannot Refute

[54] Muse: Text-To-Image Generation via Masked Generative Transformers PDF

Cannot Refute

[55] Vector Quantized Diffusion Model for Text-to-Image Synthesis PDF

Cannot Refute

[56] SINE: SINgle Image Editing with Text-to-Image Diffusion Models PDF

Cannot Refute

[57] Zero-Shot Text-to-Image Generation PDF

Cannot Refute

[58] A Neural Space-Time Representation for Text-to-Image Personalization PDF

Cannot Refute

[59] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation PDF

Cannot Refute

[60] HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models PDF

Cannot Refute

[61] StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis PDF

Cannot Refute

RMFlow: Refined Mean Flow by a Noise-Injection Step for Multimodal Generation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[29] MeanFlow-Accelerated Multimodal Video-to-Audio Synthesis via One-Step Generation PDF

Contribution Analysis

RMFlow: 1-NFE multimodal generative model with noise-injection refinement

[44] ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities PDF

[45] Short-Term Residential Load Forecasting Based on Generative Diffusion Models and Attention Mechanisms PDF

[46] A versatile diffusion transformer with mixture of noise levels for audiovisual generation PDF

[47] MDG: Masked Denoising Generation for Multi-Agent Behavior Modeling in Traffic Environments PDF

[48] Anomaly Detection for Multivariate Industrial Time Series Based on Consistency Models PDF

[49] InfoDCL: Informative Noise Enhanced Diffusion Based Contrastive Learning PDF

[50] Guidance-Driven Visual Synthesis with Generative Models PDF

[51] Multi-Class Brain Stroke Segmentation Using Stable Diffusion PDF

Theoretically principled training objective balancing Wasserstein distance and likelihood maximization

[64] Sliced-Wasserstein normalizing flows: beyond maximum likelihood training PDF

[62] Some advances in Bayesian inference and generative modeling PDF

[63] Wasserstein of Wasserstein loss for learning generative models PDF

[65] A Likelihood Based Approach to Distribution Regression Using Conditional Deep Generative Models PDF

[66] Wasserstein learning of deep generative point process models PDF

[67] Convergence of flow-based generative models via proximal gradient descent in wasserstein space PDF

[68] Inferential Wasserstein generative adversarial networks PDF

[69] Unsupervised approaches based on optimal transport and convex analysis for inverse problems in imaging PDF

[70] Wasserstein generative adversarial networks for modeling marked events. PDF

[71] Generative Adversarial Networks based on optimal transport: a survey PDF

Near state-of-the-art results on benchmark generation tasks using only 1-NFE

[52] Scaling up GANs for Text-to-Image Synthesis PDF

[53] InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning PDF

[54] Muse: Text-To-Image Generation via Masked Generative Transformers PDF

[55] Vector Quantized Diffusion Model for Text-to-Image Synthesis PDF

[56] SINE: SINgle Image Editing with Text-to-Image Diffusion Models PDF

[57] Zero-Shot Text-to-Image Generation PDF

[58] A Neural Space-Time Representation for Text-to-Image Personalization PDF

[59] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation PDF

[60] HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models PDF

[61] StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis PDF

Table of Contents