FlowNar: Scalable Streaming Narration for Long-Form Videos

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

streaming video narrationvision language modelslong-form video understandingcross linear attentive memory

Recent Large Multimodal Models (LMMs), primarily designed for offline settings, are ill-suited for the dynamic requirements of streaming video. While recent online adaptations improve real-time processing, they still face critical scalability challenges, with resource demands typically growing at least linearly with video duration. To overcome this bottleneck, we propose FlowNar, a novel framework for scalable streaming video narration. The core of FlowNar is a dynamic context management strategy for historical visual context removal, combined with our novel CLAM (Cross Linear Attentive Memory) module for streaming visual history retention, ensuring bounded visual memory usage and computational complexity, crucial for efficient streaming. We also introduce a realistic autoregressive evaluation protocol and complementary evaluation metrics to assess streaming narration models under deployment-like conditions. Experiments on Ego4D, EgoExo4D, and EpicKitchens100 datasets demonstrate that FlowNar substantially improves narration quality over strong baselines while being highly efficient, supporting processing of 10 $\times$ longer videos and achieving 3 $\times$ higher throughput (FPS).

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

FlowNar proposes a memory-efficient streaming framework for long-form video narration, introducing dynamic context management and the CLAM module to maintain bounded visual memory usage. The paper resides in the Memory-Efficient Streaming Frameworks leaf, which contains only three papers total, including FlowNar itself. This represents a relatively sparse research direction within the broader taxonomy of streaming video narration, suggesting the specific focus on bounded memory and scalable streaming architectures is not yet densely populated. The sibling papers Flash-VStream and Flash-VStream Efficient share similar goals of real-time processing with constrained resources.

The taxonomy reveals that FlowNar's leaf sits within the Streaming and Real-Time Video Processing Architectures branch, which also includes Online Dense Captioning with Temporal Localization (four papers) and Interactive Real-Time Video Understanding (one paper). Neighboring branches address Long-Form Video Understanding through hierarchical methods and Computational Efficiency via token reduction or state-space models. FlowNar diverges from hierarchical approaches by emphasizing single-pass incremental processing rather than multi-level aggregation, and from pure efficiency techniques by integrating memory management directly into the streaming architecture rather than applying post-hoc optimizations.

Among thirty candidates examined, the CLAM module contribution shows one refutable candidate from ten examined, indicating some overlap with prior memory mechanisms in streaming contexts. The FlowNar framework itself and the autoregressive evaluation protocol each examined ten candidates with zero refutations, suggesting these contributions address gaps less directly covered by the limited search scope. The framework's emphasis on dynamic context removal appears more distinctive than the memory module design, though the modest search scale means substantial related work may exist beyond the top-thirty semantic matches retrieved.

Given the sparse population of the Memory-Efficient Streaming Frameworks leaf and the limited refutation rate across contributions, FlowNar appears to occupy a relatively underexplored niche within streaming video narration. However, the analysis covers only thirty candidates from semantic search, leaving open the possibility that relevant work exists in adjacent areas such as efficient video encoders or temporal modeling techniques not captured by the search strategy.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: streaming video narration for long-form videos. The field addresses the challenge of generating natural language descriptions for extended video content in real time or near-real time, requiring systems that can process continuous streams efficiently while maintaining coherent narrative structure. The taxonomy organizes research into six main branches: Streaming and Real-Time Video Processing Architectures focus on frameworks that handle incoming frames with minimal latency and bounded memory; Long-Form Video Understanding and Hierarchical Captioning tackles the problem of summarizing or segmenting hours of footage into meaningful narrative units; Video-Language Representation Learning and Pretraining develops foundational models that align visual and textual modalities; Computational Efficiency and Scalability Techniques explore methods to reduce inference cost and enable deployment at scale; Cross-Modal Alignment and Grounding ensures that generated text accurately reflects temporal events and spatial regions in the video; and Application-Specific Video Captioning Systems tailor solutions to domains such as instructional content, live events, or accessibility services. Representative works like Flash-VStream[10] and HourVideo[11] illustrate how memory-efficient streaming frameworks balance throughput with narrative quality, while methods such as Streaming Dense Captioning[3] and MM-Narrator[16] demonstrate hierarchical approaches to long-form understanding. Several active lines of work reveal key trade-offs between latency, memory footprint, and caption richness. Memory-efficient streaming frameworks prioritize bounded state and incremental processing, enabling real-time operation on resource-constrained devices, whereas hierarchical captioning methods often require multiple passes or segment-level aggregation to produce coherent long-form narratives. FlowNar[0] sits within the Memory-Efficient Streaming Frameworks cluster, emphasizing low-latency narration with constrained memory usage, closely aligned with Flash-VStream[10] and Flash-VStream Efficient[12], which similarly target real-time performance through efficient state management. Compared to approaches like HourVideo[11] that handle extremely long videos by hierarchical summarization, FlowNar[0] focuses on continuous, single-pass processing, trading off some global coherence for immediate responsiveness. Open questions remain around how to best integrate cross-modal grounding and temporal reasoning within strict streaming constraints, and whether hybrid architectures can reconcile the benefits of both incremental and hierarchical strategies for diverse application scenarios.

Claimed Contributions

FLOWNAR framework for scalable streaming video narration

10 retrieved papers

The authors introduce FLOWNAR, a framework that enables scalable streaming video narration through dynamic context management for historical visual context removal, ensuring bounded visual memory usage and computational complexity crucial for efficient streaming of long-form videos.

10 retrieved papers

Cross Linear Attentive Memory (CLAM) module

Can Refute

10 retrieved papers

The authors propose CLAM, a novel streaming memory module that reformulates linear attention as a visual compressor to iteratively extract and retain relevant visual information from processed segments into a fixed-size set of memory tokens, providing constant memory usage and per-step computational complexity.

10 retrieved papers

Can Refute

Autoregressive evaluation protocol and complementary metrics

10 retrieved papers

The authors develop a realistic autoregressive evaluation protocol where models condition on their own previously generated narrations rather than ground-truth history, along with a first-align-then-evaluate procedure and new metrics to assess streaming narration performance under deployment-like conditions.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[10] Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams PDF

Zhang Hao-ji, Wang Yi-qin, Haoji Zhang, Tang, Yansong, Yiqin Wang, Liu Yong, Yansong Tang, Feng, Jiashi, Yong Liu, Dai, Jifeng, Jiashi Feng, Jin, Xiaojie, Jifeng Dai, Xiaojie Jin (2024) • arXiv.org

[12] Flash-VStream: Efficient Real-Time Understanding for Long Video Streams PDF

Zhang Hao-ji, Wang Yi-qin, Haoji Zhang, Tang, Yansong, Yiqin Wang, Liu Yong, Yansong Tang, Feng, Jiashi, Yong Liu, Jin, Xiaojie, Jiashi Feng, Xiaojie Jin (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FLOWNAR framework for scalable streaming video narration

[6] Cross-modal transformer-based streaming dense video captioning with neural ode temporal localization PDF

Cannot Refute

[21] NeuroVidX: Text-To-Video Diffusion Models with an Expert Transformer PDF

Cannot Refute

[44] StreamChat: Chatting with Streaming Video PDF

Cannot Refute

[45] Leveraging LSTM and CNN for Video Understanding PDF

Cannot Refute

[46] MIRA-CAP: Memory-Integrated Retrieval-Augmented captioning for State-of-the-Art image and video captioning PDF

Cannot Refute

[47] V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval PDF

Cannot Refute

[48] VaBUS: Edge-Cloud Real-Time Video Analytics via Background Understanding and Subtraction PDF

Cannot Refute

[49] Streamer: Streaming representation learning and event segmentation in a hierarchical manner PDF

Cannot Refute

[50] Understanding temporal structure for video captioning PDF

Cannot Refute

[51] Scalable Video Streaming Solutions Using Federated Learning PDF

Cannot Refute

Contribution

Cross Linear Attentive Memory (CLAM) module

[35] Livos: Light video object segmentation with gated linear matching PDF

Can Refute

[34] Givic: Generative implicit video compression PDF

Cannot Refute

[36] Video Compression through Advanced Video Saliency Aware Spatial-Temporal Integration and Attention Mechanisms PDF

Cannot Refute

[37] Sana-video: Efficient video generation with block linear diffusion transformer PDF

Cannot Refute

[38] Mimt: Masked image modeling transformer for video compression PDF

Cannot Refute

[39] ReWind: Understanding Long Videos with Instructed Learnable Memory PDF

Cannot Refute

[40] Dynamic Sparsity in Large-Scale Video DiT Training PDF

Cannot Refute

[41] Neural Image Compression With Multi-Type Feature Fusion and Multi-Distribution Mixture Likelihood PDF

Cannot Refute

[42] BiECVC: Gated Diversification of Bidirectional Contexts for Learned Video Compression PDF

Cannot Refute

[43] Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guidance PDF

Cannot Refute

Contribution

Autoregressive evaluation protocol and complementary metrics

[19] VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation PDF

Cannot Refute

[25] VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory PDF

Cannot Refute

[26] Live video captioning PDF

Cannot Refute

[27] VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning PDF

Cannot Refute

[28] A method for real-time translation of online video subtitles in sports events PDF

Cannot Refute

[29] Attention based video captioning framework for Hindi PDF

Cannot Refute

[30] Statistical sequential analysis for real-time video scene change detection on compressed multimedia bitstream PDF

Cannot Refute

[31] Step by Step: A Gradual Approach for Dense Video Captioning PDF

Cannot Refute

[32] Training-free Online Video Step Grounding PDF

Cannot Refute

[33] Bridging Video and Text: A Two-Step Polishing Transformer for Video Captioning PDF

Cannot Refute

FlowNar: Scalable Streaming Narration for Long-Form Videos

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[10] Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams PDF

[12] Flash-VStream: Efficient Real-Time Understanding for Long Video Streams PDF

Contribution Analysis

FLOWNAR framework for scalable streaming video narration

[6] Cross-modal transformer-based streaming dense video captioning with neural ode temporal localization PDF

[21] NeuroVidX: Text-To-Video Diffusion Models with an Expert Transformer PDF

[44] StreamChat: Chatting with Streaming Video PDF

[45] Leveraging LSTM and CNN for Video Understanding PDF

[46] MIRA-CAP: Memory-Integrated Retrieval-Augmented captioning for State-of-the-Art image and video captioning PDF

[47] V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval PDF

[48] VaBUS: Edge-Cloud Real-Time Video Analytics via Background Understanding and Subtraction PDF

[49] Streamer: Streaming representation learning and event segmentation in a hierarchical manner PDF

[50] Understanding temporal structure for video captioning PDF

[51] Scalable Video Streaming Solutions Using Federated Learning PDF

Cross Linear Attentive Memory (CLAM) module

[35] Livos: Light video object segmentation with gated linear matching PDF

[34] Givic: Generative implicit video compression PDF

[36] Video Compression through Advanced Video Saliency Aware Spatial-Temporal Integration and Attention Mechanisms PDF

[37] Sana-video: Efficient video generation with block linear diffusion transformer PDF

[38] Mimt: Masked image modeling transformer for video compression PDF

[39] ReWind: Understanding Long Videos with Instructed Learnable Memory PDF

[40] Dynamic Sparsity in Large-Scale Video DiT Training PDF

[41] Neural Image Compression With Multi-Type Feature Fusion and Multi-Distribution Mixture Likelihood PDF

[42] BiECVC: Gated Diversification of Bidirectional Contexts for Learned Video Compression PDF

[43] Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guidance PDF

Autoregressive evaluation protocol and complementary metrics

[19] VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation PDF

[25] VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory PDF

[26] Live video captioning PDF

[27] VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning PDF

[28] A method for real-time translation of online video subtitles in sports events PDF

[29] Attention based video captioning framework for Hindi PDF

[30] Statistical sequential analysis for real-time video scene change detection on compressed multimedia bitstream PDF

[31] Step by Step: A Gradual Approach for Dense Video Captioning PDF

[32] Training-free Online Video Step Grounding PDF

[33] Bridging Video and Text: A Two-Step Polishing Transformer for Video Captioning PDF

Table of Contents