StreamingVLM: Real-Time Understanding for Infinite Video Streams

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Machine learningVision Language ModelML System

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code will be released upon publication.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

StreamingVLM introduces a unified framework for real-time understanding of infinite video streams by maintaining a compact KV cache with attention sinks, recent vision tokens, and recent text tokens. The paper resides in the 'Attention Sink and Window Methods' leaf under 'Real-Time Streaming Inference Frameworks', which contains only two papers total (StreamingVLM and one sibling). This represents a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting the specific combination of attention sinks with streaming video understanding remains an emerging area rather than a crowded subfield.

The taxonomy reveals that StreamingVLM's approach sits at the intersection of multiple research directions. Neighboring leaves include 'Incremental Decoding Systems' (three papers) and 'Proactive and Anticipatory Agents' (two papers), both addressing real-time processing but through different mechanisms. The broader 'Memory-Based Streaming Video Understanding' branch (eight papers across three leaves) offers alternative solutions using explicit memory structures rather than attention-based windowing. StreamingVLM diverges from these memory-centric approaches by relying purely on architectural attention patterns, positioning it as a complementary strategy within the field's technical landscape.

Among twenty-two candidates examined through limited semantic search, none clearly refute any of StreamingVLM's three core contributions. The framework itself was assessed against ten candidates with zero refutable overlaps; the overlapped full-attention fine-tuning strategy examined two candidates with no prior work identified; and the Inf-Streams-Eval benchmark reviewed ten candidates without finding existing equivalents. These statistics suggest that within the examined scope, the specific combination of attention sink mechanisms, overlapped chunk training, and the proposed benchmark appears distinct from prior work, though the limited search scale means unexplored literature may exist.

The analysis indicates StreamingVLM occupies a relatively novel position within the examined literature, particularly given its sparse taxonomy leaf and absence of refuting candidates among twenty-two papers reviewed. However, this assessment is constrained by the top-K semantic search methodology and does not constitute an exhaustive survey of all related work in video understanding, attention mechanisms, or streaming inference systems.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: real-time understanding of infinite video streams. The field addresses how models can continuously process unbounded video inputs with limited memory and computational resources. The taxonomy reveals several complementary branches: Memory-Based Streaming Video Understanding explores techniques for selectively retaining and retrieving salient information from long histories (e.g., Flash-VStream Memory[1], StreamMem[44]); Real-Time Streaming Inference Frameworks focuses on efficient architectures and attention mechanisms that enable low-latency processing (including Attention Sink and Window Methods where StreamingVLM[0] resides, alongside works like Multitask Streaming Representation[25]); Streaming Video Interaction and Dialogue examines conversational systems that respond to ongoing visual input (e.g., StreamChat[22], VideoChat[9]); Training Strategies for Streaming Video investigates how to adapt models for continuous scenarios (Streaming Instruction Tuning[23]); Temporal Reasoning and Segmentation handles event boundaries and action detection in streams (Streaming Dense Captioning[5], Temporal Action Segmentation[31]); Evaluation Benchmarks and Datasets provide standardized testbeds (H2VU Benchmark[37]); Domain-Specific Streaming Applications target specialized use cases like sign language recognition (Sign Without Segmentation[6]) or egocentric video (Egocentric Action Detection[29]); and Supporting Technologies and Methods cover foundational techniques such as interpolation and quality-of-experience modeling. A central tension across these branches involves balancing memory efficiency with the need to retain long-range context for accurate understanding. Memory-based approaches like Flash-VStream Efficient[2] and StreamMem[44] propose compact representations, while attention-based frameworks such as StreamingVLM[0] and Multitask Streaming Representation[25] leverage windowing and sink mechanisms to manage computational costs without explicit external memory. StreamingVLM[0] sits squarely within the Real-Time Streaming Inference Frameworks branch, emphasizing attention sink and window methods that allow transformers to handle indefinitely long streams by discarding older tokens while preserving critical attention anchors. Compared to neighboring work like Multitask Streaming Representation[25], which addresses multi-task scenarios, StreamingVLM[0] focuses more narrowly on the core challenge of maintaining stable attention patterns over infinite horizons. This contrasts with memory-centric designs (Flash-VStream Memory[1]) that explicitly store and retrieve past frames, highlighting an ongoing debate about whether architectural innovations or external memory structures better serve real-time, unbounded video understanding.

Claimed Contributions

StreamingVLM framework for infinite video understanding

10 retrieved papers

The authors propose StreamingVLM, a unified framework that enables vision-language models to process infinite video streams in real time by aligning training with streaming inference. The framework maintains a compact KV cache using attention sinks, a short vision window, and a long text window, combined with contiguous RoPE to prevent positional drift.

10 retrieved papers

Overlapped full-attention supervised fine-tuning strategy

2 retrieved papers

The authors develop a training strategy that applies full attention on short, overlapped video chunks to mimic the inference-time attention pattern. This approach enables the model to learn streaming behavior without requiring training on extremely long contexts, while maintaining alignment between training and inference.

2 retrieved papers

Inf-Streams-Eval benchmark for long-form video understanding

10 retrieved papers

The authors introduce Inf-Streams-Eval, a new benchmark containing videos averaging over two hours in length that requires dense, per-second alignment between visual frames and text. This benchmark is designed to evaluate models on long-form video understanding tasks with real-time commentary requirements.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[25] Learning Streaming Video Representation via Multitask Training PDF

Yan Yi-bin, Xu, Jilan, Yibin Yan, Di, Shangzhe, Jilan Xu, Liu, Yikun, Shangzhe Di, Shi, Yudi, Yikun Liu, Chen Qirui, Yudi Shi, Li, Zeqian, Qirui Chen, Huang Yi-fei, Zeqian Li, Xie, Weidi, Yifei Huang, Weidi Xie (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

StreamingVLM framework for infinite video understanding

[51] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding PDF

Cannot Refute

[52] MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding PDF

Cannot Refute

[53] MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning PDF

Cannot Refute

[54] VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding PDF

Cannot Refute

[55] Token-Efficient Long Video Understanding for Multimodal LLMs PDF

Cannot Refute

[56] StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling PDF

Cannot Refute

[57] Streaming Video Question-Answering with In-context Video KV-Cache Retrieval PDF

Cannot Refute

[58] Video transformers: A survey PDF

Cannot Refute

[59] Vamba: Understanding hour-long videos with hybrid mamba-transformers PDF

Cannot Refute

[60] Inf-MLLM: Efficient streaming inference of multimodal large language models on a single GPU PDF

Cannot Refute

Contribution

Overlapped full-attention supervised fine-tuning strategy

[61] Dual attention matching for audio-visual event localization PDF

Cannot Refute

[62] Large Scale Subject Category Classification of Scholarly Papers with Deep Attentive Neural Networks PDF

Cannot Refute

Contribution

Inf-Streams-Eval benchmark for long-form video understanding

[63] LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding PDF

Cannot Refute

[64] EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding PDF

Cannot Refute

[65] Enhancing video-language representations with structural spatio-temporal alignment PDF

Cannot Refute

[66] ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts PDF

Cannot Refute

[67] MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions PDF

Cannot Refute

[68] Narrativebridge: Enhancing video captioning with causal-temporal narrative PDF

Cannot Refute

[69] Dense Video Understanding with Gated Residual Tokenization PDF

Cannot Refute

[70] AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration PDF

Cannot Refute

[71] Frame-Level Captions for Long Video Generation with Complex Multi Scenes PDF

Cannot Refute

[72] Video-LLMs with Temporal Visual Screening PDF

Cannot Refute

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[25] Learning Streaming Video Representation via Multitask Training PDF

Contribution Analysis

StreamingVLM framework for infinite video understanding

[51] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding PDF

[52] MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding PDF

[53] MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning PDF

[54] VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding PDF

[55] Token-Efficient Long Video Understanding for Multimodal LLMs PDF

[56] StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling PDF

[57] Streaming Video Question-Answering with In-context Video KV-Cache Retrieval PDF

[58] Video transformers: A survey PDF

[59] Vamba: Understanding hour-long videos with hybrid mamba-transformers PDF

[60] Inf-MLLM: Efficient streaming inference of multimodal large language models on a single GPU PDF

Overlapped full-attention supervised fine-tuning strategy

[61] Dual attention matching for audio-visual event localization PDF

[62] Large Scale Subject Category Classification of Scholarly Papers with Deep Attentive Neural Networks PDF

Inf-Streams-Eval benchmark for long-form video understanding

[63] LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding PDF

[64] EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding PDF

[65] Enhancing video-language representations with structural spatio-temporal alignment PDF

[66] ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts PDF

[67] MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions PDF

[68] Narrativebridge: Enhancing video captioning with causal-temporal narrative PDF

[69] Dense Video Understanding with Gated Residual Tokenization PDF

[70] AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration PDF

[71] Frame-Level Captions for Long Video Generation with Complex Multi Scenes PDF

[72] Video-LLMs with Temporal Visual Screening PDF

Table of Contents