StreamingVLM: Real-Time Understanding for Infinite Video Streams

ICLR 2026 Conference SubmissionAnonymous Authors
Machine learningVision Language ModelML System
Abstract:

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code will be released upon publication.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

StreamingVLM introduces a unified framework for real-time understanding of infinite video streams by maintaining a compact KV cache with attention sinks, recent vision tokens, and recent text tokens. The paper resides in the 'Attention Sink and Window Methods' leaf under 'Real-Time Streaming Inference Frameworks', which contains only two papers total (StreamingVLM and one sibling). This represents a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting the specific combination of attention sinks with streaming video understanding remains an emerging area rather than a crowded subfield.

The taxonomy reveals that StreamingVLM's approach sits at the intersection of multiple research directions. Neighboring leaves include 'Incremental Decoding Systems' (three papers) and 'Proactive and Anticipatory Agents' (two papers), both addressing real-time processing but through different mechanisms. The broader 'Memory-Based Streaming Video Understanding' branch (eight papers across three leaves) offers alternative solutions using explicit memory structures rather than attention-based windowing. StreamingVLM diverges from these memory-centric approaches by relying purely on architectural attention patterns, positioning it as a complementary strategy within the field's technical landscape.

Among twenty-two candidates examined through limited semantic search, none clearly refute any of StreamingVLM's three core contributions. The framework itself was assessed against ten candidates with zero refutable overlaps; the overlapped full-attention fine-tuning strategy examined two candidates with no prior work identified; and the Inf-Streams-Eval benchmark reviewed ten candidates without finding existing equivalents. These statistics suggest that within the examined scope, the specific combination of attention sink mechanisms, overlapped chunk training, and the proposed benchmark appears distinct from prior work, though the limited search scale means unexplored literature may exist.

The analysis indicates StreamingVLM occupies a relatively novel position within the examined literature, particularly given its sparse taxonomy leaf and absence of refuting candidates among twenty-two papers reviewed. However, this assessment is constrained by the top-K semantic search methodology and does not constitute an exhaustive survey of all related work in video understanding, attention mechanisms, or streaming inference systems.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: real-time understanding of infinite video streams. The field addresses how models can continuously process unbounded video inputs with limited memory and computational resources. The taxonomy reveals several complementary branches: Memory-Based Streaming Video Understanding explores techniques for selectively retaining and retrieving salient information from long histories (e.g., Flash-VStream Memory[1], StreamMem[44]); Real-Time Streaming Inference Frameworks focuses on efficient architectures and attention mechanisms that enable low-latency processing (including Attention Sink and Window Methods where StreamingVLM[0] resides, alongside works like Multitask Streaming Representation[25]); Streaming Video Interaction and Dialogue examines conversational systems that respond to ongoing visual input (e.g., StreamChat[22], VideoChat[9]); Training Strategies for Streaming Video investigates how to adapt models for continuous scenarios (Streaming Instruction Tuning[23]); Temporal Reasoning and Segmentation handles event boundaries and action detection in streams (Streaming Dense Captioning[5], Temporal Action Segmentation[31]); Evaluation Benchmarks and Datasets provide standardized testbeds (H2VU Benchmark[37]); Domain-Specific Streaming Applications target specialized use cases like sign language recognition (Sign Without Segmentation[6]) or egocentric video (Egocentric Action Detection[29]); and Supporting Technologies and Methods cover foundational techniques such as interpolation and quality-of-experience modeling. A central tension across these branches involves balancing memory efficiency with the need to retain long-range context for accurate understanding. Memory-based approaches like Flash-VStream Efficient[2] and StreamMem[44] propose compact representations, while attention-based frameworks such as StreamingVLM[0] and Multitask Streaming Representation[25] leverage windowing and sink mechanisms to manage computational costs without explicit external memory. StreamingVLM[0] sits squarely within the Real-Time Streaming Inference Frameworks branch, emphasizing attention sink and window methods that allow transformers to handle indefinitely long streams by discarding older tokens while preserving critical attention anchors. Compared to neighboring work like Multitask Streaming Representation[25], which addresses multi-task scenarios, StreamingVLM[0] focuses more narrowly on the core challenge of maintaining stable attention patterns over infinite horizons. This contrasts with memory-centric designs (Flash-VStream Memory[1]) that explicitly store and retrieve past frames, highlighting an ongoing debate about whether architectural innovations or external memory structures better serve real-time, unbounded video understanding.

Claimed Contributions

StreamingVLM framework for infinite video understanding

The authors propose StreamingVLM, a unified framework that enables vision-language models to process infinite video streams in real time by aligning training with streaming inference. The framework maintains a compact KV cache using attention sinks, a short vision window, and a long text window, combined with contiguous RoPE to prevent positional drift.

10 retrieved papers
Overlapped full-attention supervised fine-tuning strategy

The authors develop a training strategy that applies full attention on short, overlapped video chunks to mimic the inference-time attention pattern. This approach enables the model to learn streaming behavior without requiring training on extremely long contexts, while maintaining alignment between training and inference.

2 retrieved papers
Inf-Streams-Eval benchmark for long-form video understanding

The authors introduce Inf-Streams-Eval, a new benchmark containing videos averaging over two hours in length that requires dense, per-second alignment between visual frames and text. This benchmark is designed to evaluate models on long-form video understanding tasks with real-time commentary requirements.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

StreamingVLM framework for infinite video understanding

The authors propose StreamingVLM, a unified framework that enables vision-language models to process infinite video streams in real time by aligning training with streaming inference. The framework maintains a compact KV cache using attention sinks, a short vision window, and a long text window, combined with contiguous RoPE to prevent positional drift.

Contribution

Overlapped full-attention supervised fine-tuning strategy

The authors develop a training strategy that applies full attention on short, overlapped video chunks to mimic the inference-time attention pattern. This approach enables the model to learn streaming behavior without requiring training on extremely long contexts, while maintaining alignment between training and inference.

Contribution

Inf-Streams-Eval benchmark for long-form video understanding

The authors introduce Inf-Streams-Eval, a new benchmark containing videos averaging over two hours in length that requires dense, per-second alignment between visual frames and text. This benchmark is designed to evaluate models on long-form video understanding tasks with real-time commentary requirements.

StreamingVLM: Real-Time Understanding for Infinite Video Streams | Novelty Validation