HiVid: LLM-Guided Video Saliency For Content-Aware VOD And Live Streaming

ICLR 2026 Conference SubmissionAnonymous Authors
Video StreamingHighlight DetectionLarge Language ModelTime Series Forecasting
Abstract:

Content-aware streaming requires dynamic, chunk-level importance weights to optimize subjective quality of experience (QoE). However, direct human annotation is prohibitively expensive while vision-saliency models generalize poorly. We introduce HiVid, the first framework to leverage Large Language Models (LLMs) as a scalable human proxy to generate high-fidelity weights for both Video-on-Demand (VOD) and live streaming. We address 3 non-trivial challenges: (1) To extend LLMs' limited modality and circumvent token limits, we propose a perception module to assess frames in a local context window, autoregressively building a coherent understanding of the video. (2) For VOD with rating inconsistency across local windows, we propose a ranking module to perform global re-ranking with a novel LLM-guided merge-sort algorithm. (3) For live streaming which requires low-latency, online inference without future knowledge, we propose a prediction module to predict future weights with a multi-modal time series model, which comprises a content-aware attention and adaptive horizon to accommodate asynchronous LLM inference. Extensive experiments show HiVid improves weight prediction accuracy by up to 11.5% for VOD and 26% for live streaming over SOTA baselines. Real-world user study validates HiVid boosts streaming QoE correlation by 14.7%.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

HiVid introduces a framework for LLM-guided saliency prediction to generate chunk-level importance weights for content-aware streaming, addressing both VOD and live scenarios. The taxonomy places this work in the 'Saliency-Based Quality-of-Experience Optimization' leaf under 'Content-Aware Adaptive Streaming Optimization'. Notably, this leaf contains only the original paper itself—no sibling papers exist in this specific category. This positioning suggests the work occupies a relatively sparse research direction within the broader field of LLM-guided video streaming, which comprises six total papers across multiple branches.

The taxonomy reveals neighboring research in 'User Behavior-Aware Streaming Control' (one paper on adaptive streaming without saliency modeling) and parallel branches addressing 'Token Management and Compression' (three papers on efficient video-LLM processing) and 'Real-Time and Procedural Video Understanding' (one paper on temporal reasoning). HiVid diverges from these directions by focusing specifically on perceptual importance weighting rather than computational efficiency or user navigation prediction. The scope notes clarify that saliency-based QoE optimization explicitly excludes user jump prediction approaches, positioning HiVid as addressing a distinct problem formulation within content-aware delivery systems.

Among twenty-four candidates examined, none clearly refute the three core contributions. The HiVid framework itself was assessed against four candidates with zero refutations. The three-module architecture (perception, ranking, prediction) was examined against ten candidates, finding no overlapping prior work. The content-aware attention mechanism for multi-modal forecasting similarly showed no clear precedent among ten examined candidates. This absence of refutations across all contributions suggests that within the limited search scope, the combination of LLM-guided saliency prediction, global re-ranking, and low-latency forecasting for streaming appears novel.

The analysis reflects a constrained literature search rather than exhaustive coverage. The taxonomy's sparse population in the target leaf and the absence of refutations among examined candidates indicate potential novelty, but the small search scale (twenty-four papers) and narrow field structure (six total papers) limit definitive conclusions. The work appears to introduce a new problem formulation—using LLMs as human proxies for perceptual importance—that existing compression-focused or behavior-prediction methods do not directly address.

Taxonomy

Core-task Taxonomy Papers
6
3
Claimed Contributions
24
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: LLM-guided video saliency prediction for content-aware streaming. The field structure reflects a convergence of efficient video-language modeling and adaptive delivery systems. Token Management and Compression for Streaming Video-LLMs addresses the computational bottleneck of processing long videos by developing methods such as Recurrent Token Selection[5] and Hierarchical Token Compression[6] that reduce redundancy while preserving semantic content. Real-Time and Procedural Video Understanding focuses on temporal reasoning and event detection in streaming contexts, exemplified by Streaming Long Video[1] and Streaming VideoLLMs[2]. Multi-Modal Perception and Segmentation tackles fine-grained spatial understanding, with works like Perceive Anything[3] enabling object-level awareness. Content-Aware Adaptive Streaming Optimization integrates these capabilities into delivery frameworks, where systems like JumpDASH[4] leverage content understanding to optimize bandwidth allocation and user experience. A central tension emerges between computational efficiency and perceptual fidelity: token compression methods enable real-time processing but risk discarding visually salient regions, while exhaustive multi-modal analysis provides richer understanding at prohibitive cost. HiVid[0] sits within the Content-Aware Adaptive Streaming Optimization branch, specifically targeting saliency-based quality-of-experience optimization. Unlike JumpDASH[4], which focuses on navigation-driven bitrate adaptation, HiVid[0] emphasizes LLM-guided prediction of viewer attention to allocate quality budgets spatially and temporally. This approach contrasts with purely compression-focused methods like Recurrent Token Selection[5], which optimize for semantic retention rather than perceptual importance. The work bridges video-language understanding and streaming delivery, addressing how content semantics can inform adaptive quality decisions in bandwidth-constrained environments.

Claimed Contributions

HiVid framework for LLM-guided content-aware streaming

The authors propose HiVid, a novel framework that uses LLMs to generate chunk-level importance weights for content-aware video streaming. This addresses the trade-off between inaccurate vision-based models and expensive human annotation by using LLMs as a scalable proxy for human judgment in both VOD and live streaming scenarios.

4 retrieved papers
Three-module architecture addressing modality, consistency, and latency challenges

The authors develop three specialized modules to overcome key challenges: a perception module using sliding windows to handle LLM modality and token constraints, a ranking module with LLM-guided merge sort to eliminate rating inconsistencies in VOD, and a prediction module with adaptive forecasting for real-time live streaming without future knowledge.

10 retrieved papers
Novel content-aware attention mechanism for multi-modal forecasting

The authors introduce a content-aware attention mechanism that captures the interdependent relationships between time series weights and multi-modal video content (frames and text summaries). This mechanism specifically learns how historical video content influences the evolution of time series weights for improved prediction accuracy.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

HiVid framework for LLM-guided content-aware streaming

The authors propose HiVid, a novel framework that uses LLMs to generate chunk-level importance weights for content-aware video streaming. This addresses the trade-off between inaccurate vision-based models and expensive human annotation by using LLMs as a scalable proxy for human judgment in both VOD and live streaming scenarios.

Contribution

Three-module architecture addressing modality, consistency, and latency challenges

The authors develop three specialized modules to overcome key challenges: a perception module using sliding windows to handle LLM modality and token constraints, a ranking module with LLM-guided merge sort to eliminate rating inconsistencies in VOD, and a prediction module with adaptive forecasting for real-time live streaming without future knowledge.

Contribution

Novel content-aware attention mechanism for multi-modal forecasting

The authors introduce a content-aware attention mechanism that captures the interdependent relationships between time series weights and multi-modal video content (frames and text summaries). This mechanism specifically learns how historical video content influences the evolution of time series weights for improved prediction accuracy.