VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

long video understandingvideo language model

Long-context video modeling is critical for multimodal large language models (MLLMs), enabling them to process movies, online video streams, and so on. Despite its advances, handling long videos remains challenging due to the difficulty in efficiently understanding the extremely long video context. This paper aims to address this issue from aspects of the model architecture, training data, training strategy, and evaluation benchmark. First, we propose a novel Hierarchical video token Compression (HiCo) method, which leverages visual redundancy in long videos to compress long video context from Clip-level to Video-level, reducing the computation significantly while preserving essential details, achieving an extreme compression ratio of approximately 1/50 with almost no performance loss. Second, we introduce a multi-stage short-to-long learning scheme, a large-scale dataset of real-world long videos named LongVid, and a challenging “Multi-Hop Needle-In- A-Video-Haystack” benchmark. Finally, we build a powerful video MLLM named VideoChat-Flash, which shows a leading performance on both mainstream long and short video benchmarks at the 2B and 7B model scales. It first gets 99.1% accuracy over 10,000 frames in NIAH among open-source models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a hierarchical video token compression method (HiCo) achieving approximately 1/50 compression ratio, alongside a multi-stage short-to-long training scheme, the LongVid dataset, and a Multi-Hop Needle-In-A-Video-Haystack benchmark. It resides in the 'Hierarchical and Adaptive Compression' leaf under 'Token Compression and Efficiency Mechanisms', sharing this space with three sibling papers (LongVU, Video-XL, and one other). This leaf represents a moderately populated research direction within a broader taxonomy of 50 papers across approximately 36 topics, indicating active but not overcrowded exploration of adaptive compression strategies for long-context video understanding.

The taxonomy reveals neighboring leaves focused on spatiotemporal token reduction via dedicated temporal encoders, streaming architectures with constant token budgets, and slow-fast dual-pathway designs. These adjacent directions emphasize different trade-offs: spatiotemporal methods prioritize inter-frame dependency modeling, while streaming approaches target online processing constraints. The paper's hierarchical compression strategy bridges efficiency-driven token reduction and reasoning-oriented temporal understanding, contrasting with fixed-rule compression or single-pathway methods. Its position suggests engagement with both compression efficiency and preservation of semantic detail across extended video timelines, distinguishing it from purely architectural or temporal grounding approaches in sibling branches.

Among 30 candidates examined, the HiCo compression method shows no clear refutation across 10 candidates, suggesting relative novelty in its specific hierarchical design. The Multi-Hop Needle benchmark similarly appears unrefuted across 10 candidates, indicating potential originality in evaluation methodology. However, the LongVid dataset and short-to-long learning strategy encountered one refutable candidate among 10 examined, pointing to existing work in progressive training or large-scale long-video data curation. The limited search scope means these findings reflect top-30 semantic matches rather than exhaustive coverage, leaving open the possibility of additional relevant prior work in less semantically similar papers.

Given the moderate density of the hierarchical compression leaf and the mixed contribution-level findings, the work appears to offer incremental advances in compression architecture and benchmarking while building on established paradigms in multi-stage training and dataset construction. The analysis is constrained by the top-30 candidate scope and does not capture potential overlaps in broader compression literature or domain-specific long-video datasets outside the semantic search radius.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: long-context video modeling in multimodal large language models. The field addresses how to enable large language models to process and reason over extended video sequences, often spanning minutes or hours, by integrating visual encoders with language backbones. The taxonomy reveals several interrelated branches: Token Compression and Efficiency Mechanisms focus on reducing the computational burden of dense frame representations through hierarchical or adaptive strategies (e.g., VideoChat-Flash[0], LongVU[23]); Temporal Reasoning and Grounding emphasize understanding event sequences and temporal relationships (e.g., TemporalBench[14], TimeMarker[25]); Long-Context Extension and Scalability explore architectural modifications to handle longer inputs (e.g., LongViLA[20], Infinite Video[22]); while Multimodal Integration and Cross-Modal Reasoning tackle the fusion of vision, language, and sometimes audio (e.g., Watch and Listen[21]). Additional branches cover training strategies, benchmarking efforts like Video-MME[8] and LongVideoBench[16], foundation model pretraining (e.g., Internvideo[9]), domain-specific applications such as robotics (RoboVQA[48]) and egocentric video (MM-Ego[40]), and architectural enhancements that refine encoder designs. A particularly active line of work centers on hierarchical and adaptive compression, where methods dynamically allocate tokens based on content importance or temporal structure. VideoChat-Flash[0] exemplifies this approach by employing adaptive compression to balance efficiency and detail retention, closely aligning with LongVU[23] and Video-XL[30], which similarly pursue token-efficient representations for long videos. In contrast, works like Slow-Fast Architecture[3] and Token-Efficient Long Video[2] explore dual-rate processing or explicit token budgeting to manage computational costs. Another contrasting theme emerges in temporal grounding: some studies prioritize fine-grained event localization (TemporalBench Fine-Grained[18], VideoRefer Suite[13]), while others focus on holistic narrative understanding across extended timelines (Understanding Long Videos[5], HourVideo[38]). VideoChat-Flash[0] sits within the compression-focused cluster, sharing design principles with LongVU[23] and Video-XL[30], yet its hierarchical strategy distinguishes it from simpler uniform sampling or fixed-rate approaches, positioning it as a bridge between efficiency-driven and reasoning-oriented paradigms.

Claimed Contributions

Hierarchical video token Compression (HiCo) method

10 retrieved papers

A two-stage compression approach that first reduces inter-frame redundancy at the clip level using spatio-temporal attention and token merging, then performs video-level compression by discarding task-irrelevant tokens during LLM processing. This achieves approximately 1/50 compression ratio with minimal performance loss.

10 retrieved papers

LongVid dataset and short-to-long learning strategy

Can Refute

10 retrieved papers

A large-scale training corpus containing 114,228 long videos with 3,444,849 question-answering pairs covering five task types, combined with a multi-stage training strategy that progresses from image and short video data to joint short and long video instruction tuning.

10 retrieved papers

Can Refute

Multi-Hop Needle-In-A-Video-Haystack benchmark

10 retrieved papers

A new evaluation benchmark that requires models to follow a reasoning path through multiple images inserted into long videos, with wrong paths as distractors. This tests both retrieval and complex reasoning abilities more robustly than previous single-hop approaches.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Longvlm: Efficient long video understanding via large language models PDF

Weng, Yuetian, Han, Mingfei, Yuetian Weng, He, Haoyu, Mingfei Han, Chang, Xiaojun, Haoyu He, Zhuang, Bohan, Xiaojun Chang, Bohan Zhuang (2024)

[23] Longvu: Spatiotemporal adaptive compression for long video-language understanding PDF

Shen, Xiaoqian, Xiong Yunyang, Xiaoqian Shen, Zhao Changsheng, Yunyang Xiong, Wu Lemeng, Changsheng Zhao, Chen Jun, Lemeng Wu, Zhu Chenchen, Jun Chen, Liu, Zechun, Chenchen Zhu, Xiao, Fanyi, Zechun Liu, Varadarajan, Balakrishnan, Fanyi Xiao, Bordes, Florian, Bala Varadarajan, Zhuang, Florian Bordes, Xu Hu, Zhuang Liu, Kim, Hyunwoo J., Hu Xu, Hyunwoo J. Kim, Krishnamoorthi, Raghuraman, Bilge Soran, Elhoseiny, Mohamed, Raghuraman Krishnamoorthi, Chandra, Vikas, Mohamed Elhoseiny, Vikas Chandra (2024)

[30] Video-xl: Extra-long vision language model for hour-scale video understanding PDF

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, Bo Zhao (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Hierarchical video token Compression (HiCo) method

[59] B-vllm: A vision large language model with balanced spatio-temporal tokens PDF

Cannot Refute

[60] The devil is in temporal token: High quality video reasoning segmentation PDF

Cannot Refute

[61] Framefusion: Combining similarity and importance for video token reduction on large vision language models PDF

Cannot Refute

[62] HoliTom: Holistic Token Merging for Fast Video Large Language Models PDF

Cannot Refute

[63] RESTHT: relation-enhanced spatialâtemporal hierarchical transformer for video captioning PDF

Cannot Refute

[64] Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces PDF

Cannot Refute

[65] Midframe-centric token merging for efficient video transformer PDF

Cannot Refute

[66] Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs PDF

Cannot Refute

[67] Efficient Video Transformers via Spatial-temporal Token Merging for Action Recognition PDF

Cannot Refute

[68] STPM: Spatial-Temporal Token Pruning and Merging for Complex Activity Recognition PDF

Cannot Refute

Contribution

LongVid dataset and short-to-long learning strategy

[46] Kangaroo: A powerful video-language model supporting long-context video input PDF

Can Refute

[30] Video-xl: Extra-long vision language model for hour-scale video understanding PDF

Cannot Refute

[51] Spatialladder: Progressive training for spatial reasoning in vision-language models PDF

Cannot Refute

[52] Physformer: Facial video-based physiological measurement with temporal difference transformer PDF

Cannot Refute

[53] Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models PDF

Cannot Refute

[54] Bridging the gap: A unified video comprehension framework for moment retrieval and highlight detection PDF

Cannot Refute

[55] Adaptive curriculum learning for video captioning PDF

Cannot Refute

[56] Kwai keye-vl 1.5 technical report PDF

Cannot Refute

[57] Efficient VideoMAE via Temporal Progressive Training PDF

Cannot Refute

[58] Clearvid: Curriculum learning for video description PDF

Cannot Refute

Contribution

Multi-Hop Needle-In-A-Video-Haystack benchmark

[69] Video-of-thought: Step-by-step video reasoning from perception to cognition PDF

Cannot Refute

[70] Grounded multi-hop videoqa in long-form egocentric videos PDF

Cannot Refute

[71] SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding PDF

Cannot Refute

[72] Logic-in-frames: Dynamic keyframe search via visual semantic-logical verification for long video understanding PDF

Cannot Refute

[73] Overview of the NLPCC 2025 shared task 4: multi-modal, multilingual, and multi-hop medical instructional video question answering challenge PDF

Cannot Refute

[74] Hopper: Multi-hop transformer for spatiotemporal reasoning PDF

Cannot Refute

[75] Ddog: optimizing multi-hop inference via dual-driven retrieval and reasoning path PDF

Cannot Refute

[76] VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos PDF

Cannot Refute

[77] STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training PDF

Cannot Refute

[78] Omchat: A recipe to train multimodal language models with strong long context and video understanding PDF

Cannot Refute

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Longvlm: Efficient long video understanding via large language models PDF

[23] Longvu: Spatiotemporal adaptive compression for long video-language understanding PDF

[30] Video-xl: Extra-long vision language model for hour-scale video understanding PDF

Contribution Analysis

Hierarchical video token Compression (HiCo) method

[59] B-vllm: A vision large language model with balanced spatio-temporal tokens PDF

[60] The devil is in temporal token: High quality video reasoning segmentation PDF

[61] Framefusion: Combining similarity and importance for video token reduction on large vision language models PDF

[62] HoliTom: Holistic Token Merging for Fast Video Large Language Models PDF

[63] RESTHT: relation-enhanced spatialâtemporal hierarchical transformer for video captioning PDF

[64] Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces PDF

[65] Midframe-centric token merging for efficient video transformer PDF

[66] Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs PDF

[67] Efficient Video Transformers via Spatial-temporal Token Merging for Action Recognition PDF

[68] STPM: Spatial-Temporal Token Pruning and Merging for Complex Activity Recognition PDF

LongVid dataset and short-to-long learning strategy

[46] Kangaroo: A powerful video-language model supporting long-context video input PDF

[30] Video-xl: Extra-long vision language model for hour-scale video understanding PDF

[51] Spatialladder: Progressive training for spatial reasoning in vision-language models PDF

[52] Physformer: Facial video-based physiological measurement with temporal difference transformer PDF

[53] Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models PDF

[54] Bridging the gap: A unified video comprehension framework for moment retrieval and highlight detection PDF

[55] Adaptive curriculum learning for video captioning PDF

[56] Kwai keye-vl 1.5 technical report PDF

[57] Efficient VideoMAE via Temporal Progressive Training PDF

[58] Clearvid: Curriculum learning for video description PDF

Multi-Hop Needle-In-A-Video-Haystack benchmark

[69] Video-of-thought: Step-by-step video reasoning from perception to cognition PDF

[70] Grounded multi-hop videoqa in long-form egocentric videos PDF

[71] SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding PDF

[72] Logic-in-frames: Dynamic keyframe search via visual semantic-logical verification for long video understanding PDF

[73] Overview of the NLPCC 2025 shared task 4: multi-modal, multilingual, and multi-hop medical instructional video question answering challenge PDF

[74] Hopper: Multi-hop transformer for spatiotemporal reasoning PDF

[75] Ddog: optimizing multi-hop inference via dual-driven retrieval and reasoning path PDF

[76] VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos PDF

[77] STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training PDF

[78] Omchat: A recipe to train multimodal language models with strong long context and video understanding PDF

Table of Contents

[63] RESTHT: relation-enhanced spatialâtemporal hierarchical transformer for video captioning PDF