VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

ICLR 2026 Conference SubmissionAnonymous Authors
long video understandingvideo language model
Abstract:

Long-context video modeling is critical for multimodal large language models (MLLMs), enabling them to process movies, online video streams, and so on. Despite its advances, handling long videos remains challenging due to the difficulty in efficiently understanding the extremely long video context. This paper aims to address this issue from aspects of the model architecture, training data, training strategy, and evaluation benchmark. First, we propose a novel Hierarchical video token Compression (HiCo) method, which leverages visual redundancy in long videos to compress long video context from Clip-level to Video-level, reducing the computation significantly while preserving essential details, achieving an extreme compression ratio of approximately 1/50 with almost no performance loss. Second, we introduce a multi-stage short-to-long learning scheme, a large-scale dataset of real-world long videos named LongVid, and a challenging “Multi-Hop Needle-In- A-Video-Haystack” benchmark. Finally, we build a powerful video MLLM named VideoChat-Flash, which shows a leading performance on both mainstream long and short video benchmarks at the 2B and 7B model scales. It first gets 99.1% accuracy over 10,000 frames in NIAH among open-source models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a hierarchical video token compression method (HiCo) achieving approximately 1/50 compression ratio, alongside a multi-stage short-to-long training scheme, the LongVid dataset, and a Multi-Hop Needle-In-A-Video-Haystack benchmark. It resides in the 'Hierarchical and Adaptive Compression' leaf under 'Token Compression and Efficiency Mechanisms', sharing this space with three sibling papers (LongVU, Video-XL, and one other). This leaf represents a moderately populated research direction within a broader taxonomy of 50 papers across approximately 36 topics, indicating active but not overcrowded exploration of adaptive compression strategies for long-context video understanding.

The taxonomy reveals neighboring leaves focused on spatiotemporal token reduction via dedicated temporal encoders, streaming architectures with constant token budgets, and slow-fast dual-pathway designs. These adjacent directions emphasize different trade-offs: spatiotemporal methods prioritize inter-frame dependency modeling, while streaming approaches target online processing constraints. The paper's hierarchical compression strategy bridges efficiency-driven token reduction and reasoning-oriented temporal understanding, contrasting with fixed-rule compression or single-pathway methods. Its position suggests engagement with both compression efficiency and preservation of semantic detail across extended video timelines, distinguishing it from purely architectural or temporal grounding approaches in sibling branches.

Among 30 candidates examined, the HiCo compression method shows no clear refutation across 10 candidates, suggesting relative novelty in its specific hierarchical design. The Multi-Hop Needle benchmark similarly appears unrefuted across 10 candidates, indicating potential originality in evaluation methodology. However, the LongVid dataset and short-to-long learning strategy encountered one refutable candidate among 10 examined, pointing to existing work in progressive training or large-scale long-video data curation. The limited search scope means these findings reflect top-30 semantic matches rather than exhaustive coverage, leaving open the possibility of additional relevant prior work in less semantically similar papers.

Given the moderate density of the hierarchical compression leaf and the mixed contribution-level findings, the work appears to offer incremental advances in compression architecture and benchmarking while building on established paradigms in multi-stage training and dataset construction. The analysis is constrained by the top-30 candidate scope and does not capture potential overlaps in broader compression literature or domain-specific long-video datasets outside the semantic search radius.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: long-context video modeling in multimodal large language models. The field addresses how to enable large language models to process and reason over extended video sequences, often spanning minutes or hours, by integrating visual encoders with language backbones. The taxonomy reveals several interrelated branches: Token Compression and Efficiency Mechanisms focus on reducing the computational burden of dense frame representations through hierarchical or adaptive strategies (e.g., VideoChat-Flash[0], LongVU[23]); Temporal Reasoning and Grounding emphasize understanding event sequences and temporal relationships (e.g., TemporalBench[14], TimeMarker[25]); Long-Context Extension and Scalability explore architectural modifications to handle longer inputs (e.g., LongViLA[20], Infinite Video[22]); while Multimodal Integration and Cross-Modal Reasoning tackle the fusion of vision, language, and sometimes audio (e.g., Watch and Listen[21]). Additional branches cover training strategies, benchmarking efforts like Video-MME[8] and LongVideoBench[16], foundation model pretraining (e.g., Internvideo[9]), domain-specific applications such as robotics (RoboVQA[48]) and egocentric video (MM-Ego[40]), and architectural enhancements that refine encoder designs. A particularly active line of work centers on hierarchical and adaptive compression, where methods dynamically allocate tokens based on content importance or temporal structure. VideoChat-Flash[0] exemplifies this approach by employing adaptive compression to balance efficiency and detail retention, closely aligning with LongVU[23] and Video-XL[30], which similarly pursue token-efficient representations for long videos. In contrast, works like Slow-Fast Architecture[3] and Token-Efficient Long Video[2] explore dual-rate processing or explicit token budgeting to manage computational costs. Another contrasting theme emerges in temporal grounding: some studies prioritize fine-grained event localization (TemporalBench Fine-Grained[18], VideoRefer Suite[13]), while others focus on holistic narrative understanding across extended timelines (Understanding Long Videos[5], HourVideo[38]). VideoChat-Flash[0] sits within the compression-focused cluster, sharing design principles with LongVU[23] and Video-XL[30], yet its hierarchical strategy distinguishes it from simpler uniform sampling or fixed-rate approaches, positioning it as a bridge between efficiency-driven and reasoning-oriented paradigms.

Claimed Contributions

Hierarchical video token Compression (HiCo) method

A two-stage compression approach that first reduces inter-frame redundancy at the clip level using spatio-temporal attention and token merging, then performs video-level compression by discarding task-irrelevant tokens during LLM processing. This achieves approximately 1/50 compression ratio with minimal performance loss.

10 retrieved papers
LongVid dataset and short-to-long learning strategy

A large-scale training corpus containing 114,228 long videos with 3,444,849 question-answering pairs covering five task types, combined with a multi-stage training strategy that progresses from image and short video data to joint short and long video instruction tuning.

10 retrieved papers
Can Refute
Multi-Hop Needle-In-A-Video-Haystack benchmark

A new evaluation benchmark that requires models to follow a reasoning path through multiple images inserted into long videos, with wrong paths as distractors. This tests both retrieval and complex reasoning abilities more robustly than previous single-hop approaches.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Hierarchical video token Compression (HiCo) method

A two-stage compression approach that first reduces inter-frame redundancy at the clip level using spatio-temporal attention and token merging, then performs video-level compression by discarding task-irrelevant tokens during LLM processing. This achieves approximately 1/50 compression ratio with minimal performance loss.

Contribution

LongVid dataset and short-to-long learning strategy

A large-scale training corpus containing 114,228 long videos with 3,444,849 question-answering pairs covering five task types, combined with a multi-stage training strategy that progresses from image and short video data to joint short and long video instruction tuning.

Contribution

Multi-Hop Needle-In-A-Video-Haystack benchmark

A new evaluation benchmark that requires models to follow a reasoning path through multiple images inserted into long videos, with wrong paths as distractors. This tests both retrieval and complex reasoning abilities more robustly than previous single-hop approaches.

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling | Novelty Validation