RMFlow: Refined Mean Flow by a Noise-Injection Step for Multimodal Generation

ICLR 2026 Conference SubmissionAnonymous Authors
Mean FlowFlow MatchingNoise-injectionLikelihood MaximizationMultimodal Generation
Abstract:

Mean flow (MeanFlow) enables efficient, high-fidelity image generation, yet its single-function evaluation (1-NFE) generation often cannot yield compelling results. We address this issue by introducing RMFlow, an efficient multimodal generative model that integrates a coarse 1-NFE MeanFlow transport with a subsequent tailored noise-injection refinement step. RMFlow approximates the average velocity of the flow path using a neural network trained with a new loss function that balances minimizing the Wasserstein distance between probability paths and maximizing sample likelihood. RMFlow achieves competitive, often (near) state-of-the-art results on text-to-image, context-to-molecule, and time-series generation using 1-NFE, at a comparable computational cost to the baseline MeanFlows.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

RMFlow proposes a single-step multimodal generative model combining coarse mean flow transport with noise-injection refinement, targeting efficient generation across text-to-image, molecule synthesis, and time-series tasks. The paper resides in the Mean Flow Acceleration leaf, which contains only two papers total (RMFlow and one sibling). This represents a relatively sparse research direction within the broader taxonomy of 43 papers across 36 topics, suggesting the specific combination of mean flow prediction with refinement steps is not yet heavily explored in the literature examined.

The Mean Flow Acceleration leaf sits within the Distillation and Acceleration Techniques branch, which also includes Distribution Matching Distillation methods. Neighboring branches address trajectory optimization (Straighter Flow Matching, Motion Flow Matching) and discrete flow extensions, while application domains span audio synthesis, motion generation, and unified multimodal frameworks. RMFlow's approach diverges from explicit distillation losses used in distribution matching methods, instead learning average velocity fields directly. The taxonomy structure indicates acceleration research splits between mean flow prediction and teacher-student distillation paradigms, with RMFlow pursuing the former path.

Among 28 candidates examined, the training objective contribution (balancing Wasserstein distance and likelihood maximization) shows one refutable candidate from 10 examined, indicating some prior theoretical work exists in this space. The 1-NFE architecture contribution examined 8 candidates with none clearly refuting it, while the benchmark results contribution examined 10 candidates with no refutations found. The limited search scope (28 papers, not hundreds) means these statistics reflect top semantic matches and citation neighbors rather than exhaustive coverage. The architecture and empirical results appear more distinctive than the theoretical training objective within this bounded search.

Based on the top-28 semantic matches examined, RMFlow occupies a sparsely populated research direction (2-paper leaf) addressing single-step multimodal generation. The architecture combining mean flow with refinement shows no clear prior overlap among candidates examined, though the training objective has at least one related predecessor. The analysis covers semantically proximate work and citation neighbors but does not claim exhaustive field coverage, leaving open whether additional relevant methods exist beyond this search scope.

Taxonomy

Core-task Taxonomy Papers
43
3
Claimed Contributions
28
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: efficient multimodal generation with single-step flow matching. The field has organized itself around several complementary research directions. Core Flow Matching Architectures and Training Objectives establish foundational methods for learning probability paths, including works like Motion Flow Matching[7] and Straighter Flow Matching[20] that refine trajectory design. Distillation and Acceleration Techniques focus on reducing computational costs through approaches such as mean flow acceleration and generator matching, exemplified by Distribution Matching Distillation[1] and Flow Generator Matching[14]. Application Domains span diverse modalities including audio generation (MMAudio[11], MusFlow[3]), motion synthesis (VersatileMotion[37]), and speech-driven tasks (TechSinger[12]). Unified Multimodal Frameworks like NExT-OMNI[19] and MammothModa[31] integrate multiple modalities within single architectures, while Quantization and Deployment Optimization addresses practical deployment through methods like Quantization Diffusion Models[22]. Evaluation and Analysis Methods provide tools for assessing generation quality and model behavior across these varied settings. Within the acceleration landscape, a particularly active line explores mean flow techniques that enable few-step or single-step inference by learning to predict flow endpoints or intermediate statistics directly. RMFlow[0] sits squarely in this branch alongside MeanFlow Accelerated[29], both targeting rapid generation by bypassing iterative sampling. These methods contrast with distillation approaches like Distribution Matching Distillation[1] that transfer knowledge from multi-step teachers, and with architectural innovations like VAFlow[5] and GoalFlow[6] that modify the flow structure itself. The central trade-off involves balancing generation quality against inference speed: while RMFlow[0] emphasizes single-step efficiency through mean flow prediction, neighboring works explore whether additional architectural constraints or alternative training objectives can preserve sample diversity and fidelity when collapsing multi-step processes into one-shot generation.

Claimed Contributions

RMFlow: 1-NFE multimodal generative model with noise-injection refinement

RMFlow is a new generative model that improves upon MeanFlow by combining a single-step (1-NFE) mean flow transport with a subsequent noise-injection refinement step. This design enables efficient, high-quality generation across multiple modalities including text-to-image, context-to-molecule, and time-series tasks.

8 retrieved papers
Theoretically principled training objective balancing Wasserstein distance and likelihood maximization

The authors introduce a novel loss function that jointly optimizes the MeanFlow objective (which controls Wasserstein distance) with a likelihood maximization term derived from the noise-injection step. This combined objective provides theoretical guarantees for both distributional alignment and sample quality.

10 retrieved papers
Can Refute
Near state-of-the-art results on benchmark generation tasks using only 1-NFE

The authors demonstrate that RMFlow achieves competitive or state-of-the-art performance on multiple benchmark tasks (text-to-image on COCO, context-to-molecule on QM9, and time-series forecasting) while requiring only a single neural network evaluation, matching the computational cost of baseline MeanFlows.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RMFlow: 1-NFE multimodal generative model with noise-injection refinement

RMFlow is a new generative model that improves upon MeanFlow by combining a single-step (1-NFE) mean flow transport with a subsequent noise-injection refinement step. This design enables efficient, high-quality generation across multiple modalities including text-to-image, context-to-molecule, and time-series tasks.

Contribution

Theoretically principled training objective balancing Wasserstein distance and likelihood maximization

The authors introduce a novel loss function that jointly optimizes the MeanFlow objective (which controls Wasserstein distance) with a likelihood maximization term derived from the noise-injection step. This combined objective provides theoretical guarantees for both distributional alignment and sample quality.

Contribution

Near state-of-the-art results on benchmark generation tasks using only 1-NFE

The authors demonstrate that RMFlow achieves competitive or state-of-the-art performance on multiple benchmark tasks (text-to-image on COCO, context-to-molecule on QM9, and time-series forecasting) while requiring only a single neural network evaluation, matching the computational cost of baseline MeanFlows.