FlowNar: Scalable Streaming Narration for Long-Form Videos

ICLR 2026 Conference SubmissionAnonymous Authors
streaming video narrationvision language modelslong-form video understandingcross linear attentive memory
Abstract:

Recent Large Multimodal Models (LMMs), primarily designed for offline settings, are ill-suited for the dynamic requirements of streaming video. While recent online adaptations improve real-time processing, they still face critical scalability challenges, with resource demands typically growing at least linearly with video duration. To overcome this bottleneck, we propose FlowNar, a novel framework for scalable streaming video narration. The core of FlowNar is a dynamic context management strategy for historical visual context removal, combined with our novel CLAM (Cross Linear Attentive Memory) module for streaming visual history retention, ensuring bounded visual memory usage and computational complexity, crucial for efficient streaming. We also introduce a realistic autoregressive evaluation protocol and complementary evaluation metrics to assess streaming narration models under deployment-like conditions. Experiments on Ego4D, EgoExo4D, and EpicKitchens100 datasets demonstrate that FlowNar substantially improves narration quality over strong baselines while being highly efficient, supporting processing of 10×\times longer videos and achieving 3×\times higher throughput (FPS).

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

FlowNar proposes a memory-efficient streaming framework for long-form video narration, introducing dynamic context management and the CLAM module to maintain bounded visual memory usage. The paper resides in the Memory-Efficient Streaming Frameworks leaf, which contains only three papers total, including FlowNar itself. This represents a relatively sparse research direction within the broader taxonomy of streaming video narration, suggesting the specific focus on bounded memory and scalable streaming architectures is not yet densely populated. The sibling papers Flash-VStream and Flash-VStream Efficient share similar goals of real-time processing with constrained resources.

The taxonomy reveals that FlowNar's leaf sits within the Streaming and Real-Time Video Processing Architectures branch, which also includes Online Dense Captioning with Temporal Localization (four papers) and Interactive Real-Time Video Understanding (one paper). Neighboring branches address Long-Form Video Understanding through hierarchical methods and Computational Efficiency via token reduction or state-space models. FlowNar diverges from hierarchical approaches by emphasizing single-pass incremental processing rather than multi-level aggregation, and from pure efficiency techniques by integrating memory management directly into the streaming architecture rather than applying post-hoc optimizations.

Among thirty candidates examined, the CLAM module contribution shows one refutable candidate from ten examined, indicating some overlap with prior memory mechanisms in streaming contexts. The FlowNar framework itself and the autoregressive evaluation protocol each examined ten candidates with zero refutations, suggesting these contributions address gaps less directly covered by the limited search scope. The framework's emphasis on dynamic context removal appears more distinctive than the memory module design, though the modest search scale means substantial related work may exist beyond the top-thirty semantic matches retrieved.

Given the sparse population of the Memory-Efficient Streaming Frameworks leaf and the limited refutation rate across contributions, FlowNar appears to occupy a relatively underexplored niche within streaming video narration. However, the analysis covers only thirty candidates from semantic search, leaving open the possibility that relevant work exists in adjacent areas such as efficient video encoders or temporal modeling techniques not captured by the search strategy.

Taxonomy

Core-task Taxonomy Papers
24
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: streaming video narration for long-form videos. The field addresses the challenge of generating natural language descriptions for extended video content in real time or near-real time, requiring systems that can process continuous streams efficiently while maintaining coherent narrative structure. The taxonomy organizes research into six main branches: Streaming and Real-Time Video Processing Architectures focus on frameworks that handle incoming frames with minimal latency and bounded memory; Long-Form Video Understanding and Hierarchical Captioning tackles the problem of summarizing or segmenting hours of footage into meaningful narrative units; Video-Language Representation Learning and Pretraining develops foundational models that align visual and textual modalities; Computational Efficiency and Scalability Techniques explore methods to reduce inference cost and enable deployment at scale; Cross-Modal Alignment and Grounding ensures that generated text accurately reflects temporal events and spatial regions in the video; and Application-Specific Video Captioning Systems tailor solutions to domains such as instructional content, live events, or accessibility services. Representative works like Flash-VStream[10] and HourVideo[11] illustrate how memory-efficient streaming frameworks balance throughput with narrative quality, while methods such as Streaming Dense Captioning[3] and MM-Narrator[16] demonstrate hierarchical approaches to long-form understanding. Several active lines of work reveal key trade-offs between latency, memory footprint, and caption richness. Memory-efficient streaming frameworks prioritize bounded state and incremental processing, enabling real-time operation on resource-constrained devices, whereas hierarchical captioning methods often require multiple passes or segment-level aggregation to produce coherent long-form narratives. FlowNar[0] sits within the Memory-Efficient Streaming Frameworks cluster, emphasizing low-latency narration with constrained memory usage, closely aligned with Flash-VStream[10] and Flash-VStream Efficient[12], which similarly target real-time performance through efficient state management. Compared to approaches like HourVideo[11] that handle extremely long videos by hierarchical summarization, FlowNar[0] focuses on continuous, single-pass processing, trading off some global coherence for immediate responsiveness. Open questions remain around how to best integrate cross-modal grounding and temporal reasoning within strict streaming constraints, and whether hybrid architectures can reconcile the benefits of both incremental and hierarchical strategies for diverse application scenarios.

Claimed Contributions

FLOWNAR framework for scalable streaming video narration

The authors introduce FLOWNAR, a framework that enables scalable streaming video narration through dynamic context management for historical visual context removal, ensuring bounded visual memory usage and computational complexity crucial for efficient streaming of long-form videos.

10 retrieved papers
Cross Linear Attentive Memory (CLAM) module

The authors propose CLAM, a novel streaming memory module that reformulates linear attention as a visual compressor to iteratively extract and retain relevant visual information from processed segments into a fixed-size set of memory tokens, providing constant memory usage and per-step computational complexity.

10 retrieved papers
Can Refute
Autoregressive evaluation protocol and complementary metrics

The authors develop a realistic autoregressive evaluation protocol where models condition on their own previously generated narrations rather than ground-truth history, along with a first-align-then-evaluate procedure and new metrics to assess streaming narration performance under deployment-like conditions.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FLOWNAR framework for scalable streaming video narration

The authors introduce FLOWNAR, a framework that enables scalable streaming video narration through dynamic context management for historical visual context removal, ensuring bounded visual memory usage and computational complexity crucial for efficient streaming of long-form videos.

Contribution

Cross Linear Attentive Memory (CLAM) module

The authors propose CLAM, a novel streaming memory module that reformulates linear attention as a visual compressor to iteratively extract and retain relevant visual information from processed segments into a fixed-size set of memory tokens, providing constant memory usage and per-step computational complexity.

Contribution

Autoregressive evaluation protocol and complementary metrics

The authors develop a realistic autoregressive evaluation protocol where models condition on their own previously generated narrations rather than ground-truth history, along with a first-align-then-evaluate procedure and new metrics to assess streaming narration performance under deployment-like conditions.