Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Referring Video Object SegmentationFlow Matching

Referring Video Object Segmentation (RVOS) requires segmenting specific objects in a video guided by a natural language description. The core challenge of RVOS is to anchor abstract linguistic concepts onto a specific set of pixels and continuously segment them through the complex dynamics of a video. Faced with this difficulty, prior work has often decomposed the task into a pragmatic `locate-then-segment' pipeline. However, this cascaded design creates an information bottleneck by simplifying semantics into coarse geometric prompts (e.g, point), and struggles to maintain temporal consistency as the segmenting process is often decoupled from the initial language grounding. To overcome these fundamental limitations, we propose FlowRVS, a novel framework that reconceptualizes RVOS as a conditional continuous flow problem. This allows us to harness the inherent strengths of pretrained T2V models, fine-grained pixel control, text-video semantic alignment, and temporal coherence. Instead of conventional generating from noise to mask or directly predicting mask, we reformulate the task by learning a direct, language-guided deformation from a video's holistic representation to its target mask. Our one-stage, generative approach achieves new state-of-the-art results across all major RVOS benchmarks. Specifically, achieving a J&F of 51.1 in MeViS (+1.6 over prior SOTA) and 73.3 in the zero shot Ref-DAVIS17 (+2.7), demonstrating the significant potential of modeling video understanding tasks as continuous deformation processes.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes FlowRVS, which reformulates RVOS as a conditional continuous flow problem by learning language-guided deformations from holistic video representations to target masks. According to the taxonomy, this work resides in the 'Generative and Flow-Based Segmentation' leaf under 'Task Extensions and Variants'. Notably, this leaf contains only one paper (the original work itself), indicating this is a sparse and emerging research direction within the broader RVOS landscape of fifty papers across thirty-six topics.

The taxonomy reveals that most RVOS research concentrates on discriminative approaches: transformer-based architectures (three papers), foundation model adaptation (four papers), and explicit temporal propagation mechanisms (three papers). The generative paradigm sits apart from these mainstream directions, which typically employ cascaded 'locate-then-segment' pipelines or direct mask prediction. Neighboring leaves address reasoning-driven segmentation (four papers) and motion expression guidance (two papers), but these maintain discriminative frameworks rather than reformulating the task as continuous generation or deformation.

Among twenty-eight candidates examined across three contributions, none were identified as clearly refuting the proposed approach. For the core reformulation as continuous flow (eight candidates examined), the principled T2V transfer techniques (ten candidates), and the FlowRVS framework (ten candidates), all examined papers were classified as non-refutable or unclear. This suggests that within the limited search scope of top-K semantic matches, the specific combination of flow-based deformation and T2V model adaptation for RVOS appears relatively unexplored, though the search does not claim exhaustive coverage of all generative video segmentation literature.

Based on the limited literature search of twenty-eight candidates, the work appears to occupy a novel position by bridging text-to-video generative models with referring segmentation. However, the analysis acknowledges its scope constraints: it examines semantic neighbors rather than comprehensive generative modeling or video understanding literature. The sparse population of the generative segmentation leaf and absence of refuting candidates among examined papers suggest potential novelty, though broader searches in diffusion models or video generation domains might reveal additional relevant context.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Referring video object segmentation guided by natural language aims to segment target objects in video sequences based on textual descriptions. The field's taxonomy reveals five main branches that capture distinct research emphases. Core Architecture and Multimodal Fusion Approaches focus on how vision and language modalities are integrated, often through transformer-based designs like Multimodal Transformers[2] or cross-modal attention mechanisms such as Asymmetric Cross-guided Attention[45]. Temporal Modeling and Consistency Mechanisms address the challenge of maintaining coherent segmentations across frames, employing memory networks and temporal reasoning strategies exemplified by works like Hybrid Memory[35] and Language-Bridged Spatial-Temporal[25]. Specialized Learning Paradigms and Training Strategies explore alternative supervision signals and optimization techniques, including low-supervision settings[21] and reinforcement learning approaches[14]. Task Extensions and Variants broaden the scope to related problems such as generative segmentation, flow-based methods, and cross-domain applications like RefMask3D[30] for 3D scenarios. Finally, Datasets, Benchmarks, and Survey Literature provide foundational resources, with benchmarks like MeViS Benchmark[4] and MOSE Dataset[9], alongside comprehensive surveys[7][11][36]. Recent work has increasingly explored generative and flow-based paradigms as alternatives to traditional discriminative segmentation pipelines. Deforming Videos Masks[0] situates itself within this emerging direction under Task Extensions and Variants, leveraging generative modeling to produce segmentation masks through deformation processes. This contrasts with more conventional architectures that rely on direct mask prediction from fused multimodal features, as seen in Language-Guided Contextual Transformer[26] or Vision-Language Pretrained[16] models. While many studies emphasize temporal consistency through explicit memory modules or recurrent structures, generative approaches like Deforming Videos Masks[0] and related diffusion-based methods such as Text-to-Video Diffusion[42] offer a different perspective by modeling mask generation as a continuous transformation. This line of work raises open questions about the trade-offs between generative flexibility and computational efficiency, and how such methods can effectively incorporate temporal coherence without relying on traditional tracking or propagation mechanisms.

Claimed Contributions

Reformulation of RVOS as text-conditioned continuous flow

8 retrieved papers

The authors reconceptualize Referring Video Object Segmentation as a conditional continuous flow problem governed by an ODE, where a velocity field learns to deform video representations into target masks under text guidance. This replaces the traditional cascaded locate-then-segment paradigm with a unified end-to-end generative approach.

8 retrieved papers

Principled techniques for transferring T2V models to video understanding

10 retrieved papers

The authors introduce three synergistic adaptations—boundary-biased sampling, start-point augmentation, and direct video injection—specifically designed to address the asymmetric nature of the convergent video-to-mask flow and stabilize the learning of the initial trajectory where text-guided velocity is most critical.

10 retrieved papers

FlowRVS framework achieving state-of-the-art RVOS performance

10 retrieved papers

The authors present FlowRVS, a one-stage generative framework that achieves new state-of-the-art results across major RVOS benchmarks by leveraging the reformulated flow paradigm and proposed adaptations, demonstrating superior handling of complex language and dynamic video scenarios.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Reformulation of RVOS as text-conditioned continuous flow

[31] Language as Queries for Referring Video Object Segmentation PDF

Cannot Refute

[42] Exploring pre-trained text-to-video diffusion models for referring video object segmentation PDF

Cannot Refute

[63] Moving object segmentation: All you need is sam (and flow) PDF

Cannot Refute

[64] Fine-grained Text-Video Fusion for Referring Video Object Segmentation PDF

Cannot Refute

[65] Referring Video Object Segmentation with Cross-Modality Proxy Queries PDF

Cannot Refute

[66] Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model PDF

Cannot Refute

[67] Generative AI for Text-to-Video Generation: Recent Advances and Future Directions PDF

Cannot Refute

[68] Target-aware video object segmentation using prompt guidance PDF

Cannot Refute

Contribution

Principled techniques for transferring T2V models to video understanding

[42] Exploring pre-trained text-to-video diffusion models for referring video object segmentation PDF

Cannot Refute

[51] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding PDF

Cannot Refute

[52] FateZero: Fusing Attentions for Zero-shot Text-based Video Editing PDF

Cannot Refute

[53] ShareGPT4Video: Improving Video Understanding and Generation with Better Captions PDF

Cannot Refute

[54] Omni-video: Democratizing unified video understanding and generation PDF

Cannot Refute

[55] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation PDF

Cannot Refute

[56] Aid: Adapting image2video diffusion models for instruction-guided video prediction PDF

Cannot Refute

[57] MotionDirector: Motion Customization of Text-to-Video Diffusion Models PDF

Cannot Refute

[58] Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis PDF

Cannot Refute

[59] Learning Text-to-Video Retrieval from Image Captioning PDF

Cannot Refute

Contribution

FlowRVS framework achieving state-of-the-art RVOS performance

[1] Referdino: Referring video object segmentation with visual grounding foundations PDF

Cannot Refute

[2] End-to-end referring video object segmentation with multimodal transformers PDF

Cannot Refute

[26] Language-Guided Contextual Transformer Network for Referring Video Object Segmentation PDF

Cannot Refute

[31] Language as Queries for Referring Video Object Segmentation PDF

Cannot Refute

[34] Onlinerefer: A simple online baseline for referring video object segmentation PDF

Cannot Refute

[38] Soc: Semantic-assisted object cluster for referring video object segmentation PDF

Cannot Refute

[44] Losh: Long-short text joint prediction network for referring video object segmentation PDF

Cannot Refute

[60] You only infer once: Cross-modal meta-transfer for referring video object segmentation PDF

Cannot Refute

[61] Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation PDF

Cannot Refute

[62] End-to-End Video Instance Segmentation with Transformers PDF

Cannot Refute

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Reformulation of RVOS as text-conditioned continuous flow

[31] Language as Queries for Referring Video Object Segmentation PDF

[42] Exploring pre-trained text-to-video diffusion models for referring video object segmentation PDF

[63] Moving object segmentation: All you need is sam (and flow) PDF

[64] Fine-grained Text-Video Fusion for Referring Video Object Segmentation PDF

[65] Referring Video Object Segmentation with Cross-Modality Proxy Queries PDF

[66] Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model PDF

[67] Generative AI for Text-to-Video Generation: Recent Advances and Future Directions PDF

[68] Target-aware video object segmentation using prompt guidance PDF

Principled techniques for transferring T2V models to video understanding

[42] Exploring pre-trained text-to-video diffusion models for referring video object segmentation PDF

[51] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding PDF

[52] FateZero: Fusing Attentions for Zero-shot Text-based Video Editing PDF

[53] ShareGPT4Video: Improving Video Understanding and Generation with Better Captions PDF

[54] Omni-video: Democratizing unified video understanding and generation PDF

[55] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation PDF

[56] Aid: Adapting image2video diffusion models for instruction-guided video prediction PDF

[57] MotionDirector: Motion Customization of Text-to-Video Diffusion Models PDF

[58] Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis PDF

[59] Learning Text-to-Video Retrieval from Image Captioning PDF

FlowRVS framework achieving state-of-the-art RVOS performance

[1] Referdino: Referring video object segmentation with visual grounding foundations PDF

[2] End-to-end referring video object segmentation with multimodal transformers PDF

[26] Language-Guided Contextual Transformer Network for Referring Video Object Segmentation PDF

[31] Language as Queries for Referring Video Object Segmentation PDF

[34] Onlinerefer: A simple online baseline for referring video object segmentation PDF

[38] Soc: Semantic-assisted object cluster for referring video object segmentation PDF

[44] Losh: Long-short text joint prediction network for referring video object segmentation PDF

[60] You only infer once: Cross-modal meta-transfer for referring video object segmentation PDF

[61] Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation PDF

[62] End-to-End Video Instance Segmentation with Transformers PDF

Table of Contents