Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

ICLR 2026 Conference SubmissionAnonymous Authors
Referring Video Object SegmentationFlow Matching
Abstract:

Referring Video Object Segmentation (RVOS) requires segmenting specific objects in a video guided by a natural language description. The core challenge of RVOS is to anchor abstract linguistic concepts onto a specific set of pixels and continuously segment them through the complex dynamics of a video. Faced with this difficulty, prior work has often decomposed the task into a pragmatic `locate-then-segment' pipeline. However, this cascaded design creates an information bottleneck by simplifying semantics into coarse geometric prompts (e.g, point), and struggles to maintain temporal consistency as the segmenting process is often decoupled from the initial language grounding. To overcome these fundamental limitations, we propose FlowRVS, a novel framework that reconceptualizes RVOS as a conditional continuous flow problem. This allows us to harness the inherent strengths of pretrained T2V models, fine-grained pixel control, text-video semantic alignment, and temporal coherence. Instead of conventional generating from noise to mask or directly predicting mask, we reformulate the task by learning a direct, language-guided deformation from a video's holistic representation to its target mask. Our one-stage, generative approach achieves new state-of-the-art results across all major RVOS benchmarks. Specifically, achieving a J&F of 51.1 in MeViS (+1.6 over prior SOTA) and 73.3 in the zero shot Ref-DAVIS17 (+2.7), demonstrating the significant potential of modeling video understanding tasks as continuous deformation processes.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes FlowRVS, which reformulates RVOS as a conditional continuous flow problem by learning language-guided deformations from holistic video representations to target masks. According to the taxonomy, this work resides in the 'Generative and Flow-Based Segmentation' leaf under 'Task Extensions and Variants'. Notably, this leaf contains only one paper (the original work itself), indicating this is a sparse and emerging research direction within the broader RVOS landscape of fifty papers across thirty-six topics.

The taxonomy reveals that most RVOS research concentrates on discriminative approaches: transformer-based architectures (three papers), foundation model adaptation (four papers), and explicit temporal propagation mechanisms (three papers). The generative paradigm sits apart from these mainstream directions, which typically employ cascaded 'locate-then-segment' pipelines or direct mask prediction. Neighboring leaves address reasoning-driven segmentation (four papers) and motion expression guidance (two papers), but these maintain discriminative frameworks rather than reformulating the task as continuous generation or deformation.

Among twenty-eight candidates examined across three contributions, none were identified as clearly refuting the proposed approach. For the core reformulation as continuous flow (eight candidates examined), the principled T2V transfer techniques (ten candidates), and the FlowRVS framework (ten candidates), all examined papers were classified as non-refutable or unclear. This suggests that within the limited search scope of top-K semantic matches, the specific combination of flow-based deformation and T2V model adaptation for RVOS appears relatively unexplored, though the search does not claim exhaustive coverage of all generative video segmentation literature.

Based on the limited literature search of twenty-eight candidates, the work appears to occupy a novel position by bridging text-to-video generative models with referring segmentation. However, the analysis acknowledges its scope constraints: it examines semantic neighbors rather than comprehensive generative modeling or video understanding literature. The sparse population of the generative segmentation leaf and absence of refuting candidates among examined papers suggest potential novelty, though broader searches in diffusion models or video generation domains might reveal additional relevant context.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Referring video object segmentation guided by natural language aims to segment target objects in video sequences based on textual descriptions. The field's taxonomy reveals five main branches that capture distinct research emphases. Core Architecture and Multimodal Fusion Approaches focus on how vision and language modalities are integrated, often through transformer-based designs like Multimodal Transformers[2] or cross-modal attention mechanisms such as Asymmetric Cross-guided Attention[45]. Temporal Modeling and Consistency Mechanisms address the challenge of maintaining coherent segmentations across frames, employing memory networks and temporal reasoning strategies exemplified by works like Hybrid Memory[35] and Language-Bridged Spatial-Temporal[25]. Specialized Learning Paradigms and Training Strategies explore alternative supervision signals and optimization techniques, including low-supervision settings[21] and reinforcement learning approaches[14]. Task Extensions and Variants broaden the scope to related problems such as generative segmentation, flow-based methods, and cross-domain applications like RefMask3D[30] for 3D scenarios. Finally, Datasets, Benchmarks, and Survey Literature provide foundational resources, with benchmarks like MeViS Benchmark[4] and MOSE Dataset[9], alongside comprehensive surveys[7][11][36]. Recent work has increasingly explored generative and flow-based paradigms as alternatives to traditional discriminative segmentation pipelines. Deforming Videos Masks[0] situates itself within this emerging direction under Task Extensions and Variants, leveraging generative modeling to produce segmentation masks through deformation processes. This contrasts with more conventional architectures that rely on direct mask prediction from fused multimodal features, as seen in Language-Guided Contextual Transformer[26] or Vision-Language Pretrained[16] models. While many studies emphasize temporal consistency through explicit memory modules or recurrent structures, generative approaches like Deforming Videos Masks[0] and related diffusion-based methods such as Text-to-Video Diffusion[42] offer a different perspective by modeling mask generation as a continuous transformation. This line of work raises open questions about the trade-offs between generative flexibility and computational efficiency, and how such methods can effectively incorporate temporal coherence without relying on traditional tracking or propagation mechanisms.

Claimed Contributions

Reformulation of RVOS as text-conditioned continuous flow

The authors reconceptualize Referring Video Object Segmentation as a conditional continuous flow problem governed by an ODE, where a velocity field learns to deform video representations into target masks under text guidance. This replaces the traditional cascaded locate-then-segment paradigm with a unified end-to-end generative approach.

8 retrieved papers
Principled techniques for transferring T2V models to video understanding

The authors introduce three synergistic adaptations—boundary-biased sampling, start-point augmentation, and direct video injection—specifically designed to address the asymmetric nature of the convergent video-to-mask flow and stabilize the learning of the initial trajectory where text-guided velocity is most critical.

10 retrieved papers
FlowRVS framework achieving state-of-the-art RVOS performance

The authors present FlowRVS, a one-stage generative framework that achieves new state-of-the-art results across major RVOS benchmarks by leveraging the reformulated flow paradigm and proposed adaptations, demonstrating superior handling of complex language and dynamic video scenarios.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Reformulation of RVOS as text-conditioned continuous flow

The authors reconceptualize Referring Video Object Segmentation as a conditional continuous flow problem governed by an ODE, where a velocity field learns to deform video representations into target masks under text guidance. This replaces the traditional cascaded locate-then-segment paradigm with a unified end-to-end generative approach.

Contribution

Principled techniques for transferring T2V models to video understanding

The authors introduce three synergistic adaptations—boundary-biased sampling, start-point augmentation, and direct video injection—specifically designed to address the asymmetric nature of the convergent video-to-mask flow and stabilize the learning of the initial trajectory where text-guided velocity is most critical.

Contribution

FlowRVS framework achieving state-of-the-art RVOS performance

The authors present FlowRVS, a one-stage generative framework that achieves new state-of-the-art results across major RVOS benchmarks by leveraging the reformulated flow paradigm and proposed adaptations, demonstrating superior handling of complex language and dynamic video scenarios.