ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
Overview
Overall Novelty Assessment
The paper introduces ScaleLong, a benchmark designed to evaluate multimodal large language models on long videos through a 'within-content' multi-timescale questioning approach. It resides in the Multi-Timescale Evaluation Benchmarks leaf, which contains only two papers including this one. This sparse population suggests that systematic evaluation frameworks explicitly disentangling temporal scales within identical video content remain relatively underexplored. The benchmark features 269 videos averaging 86 minutes, with questions targeting four hierarchical timescales (clip, shot, event, story) embedded in the same content, enabling direct performance comparison across scales.
The taxonomy reveals that most research effort concentrates on architectural solutions—Hierarchical Representation and Memory Architectures, Multi-Scale Temporal Modeling Architectures, and Temporal Reasoning approaches collectively account for over half the surveyed papers. The sibling paper in this leaf (H2VU Benchmark) also addresses hierarchical video understanding but differs in design philosophy. Neighboring leaves include Long-Form Video QA and Captioning Datasets (three papers) and Domain-Specific Long Video Benchmarks (one paper), indicating that general-purpose multi-timescale evaluation remains less developed than domain-specific or single-scale benchmarks. The taxonomy's scope note explicitly distinguishes this leaf by requiring 'explicit multi-timescale evaluation design,' separating it from general long-video datasets.
Among 30 candidates examined, none clearly refute the three core contributions. The within-content multi-timescale design examined 10 candidates with zero refutable matches, suggesting this specific evaluation methodology is novel within the limited search scope. The U-shaped performance trend finding also showed no refutations across 10 candidates, though this empirical observation depends on which models were tested. The insights on visual token allocation similarly faced no direct prior work among 10 examined candidates. These statistics indicate that within the top-30 semantically similar papers, the benchmark's design choices and empirical findings appear distinctive, though exhaustive literature review might reveal additional related work.
The analysis covers a focused sample of 30 papers from semantic search, not a comprehensive survey of all video understanding benchmarks. The sparse population of the Multi-Timescale Evaluation Benchmarks leaf and absence of refuting candidates suggest the work occupies a relatively unexplored niche. However, the limited search scope means potentially relevant benchmarks from adjacent communities (e.g., video summarization, temporal action localization) may not have been fully examined. The contribution appears novel within the surveyed context, though broader validation would strengthen confidence.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce ScaleLong, a benchmark specifically designed to assess MLLMs' multi-timescale capabilities in long videos. Its key innovation is embedding questions at four hierarchical temporal scales (Clip, Shot, Event, Story) within each individual video, enabling direct comparison of model performance across timescales on identical content. The benchmark comprises 269 long videos averaging 86 minutes, spanning 5 main categories and 36 subcategories, with 4-8 questions per video ensuring at least one question per timescale.
Through comprehensive evaluation of 23 MLLMs on ScaleLong, the authors discover a consistent U-shaped performance pattern where models exhibit stronger comprehension at the shortest (Clip) and longest (Story) temporal scales but show noticeably reduced performance at intermediate scales (Shot and Event). This finding provides critical insights into how current MLLMs process information at different temporal granularities in long videos.
The authors conduct ablation studies demonstrating that strategically increasing visual token allocation consistently improves MLLM performance across all evaluated timescales. Their analysis reveals that the optimal allocation of visual tokens between frame count and resolution depends on the target timescale, providing valuable guidance for future model development in long-video understanding.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[14] H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
ScaleLong benchmark with within-content multi-timescale design
The authors introduce ScaleLong, a benchmark specifically designed to assess MLLMs' multi-timescale capabilities in long videos. Its key innovation is embedding questions at four hierarchical temporal scales (Clip, Shot, Event, Story) within each individual video, enabling direct comparison of model performance across timescales on identical content. The benchmark comprises 269 long videos averaging 86 minutes, spanning 5 main categories and 36 subcategories, with 4-8 questions per video ensuring at least one question per timescale.
[3] Temporal preference optimization for long-form video understanding PDF
[51] MVBench: A Comprehensive Multi-modal Video Understanding Benchmark PDF
[52] VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos PDF
[53] MLVU: Benchmarking Multi-task Long Video Understanding PDF
[54] Egoschema: A diagnostic benchmark for very long-form video language understanding PDF
[55] Rextime: A benchmark suite for reasoning-across-time in videos PDF
[56] Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models PDF
[57] InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows PDF
[58] Slowfocus: Enhancing fine-grained temporal understanding in video llm PDF
[59] MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding PDF
U-shaped performance trend across temporal scales
Through comprehensive evaluation of 23 MLLMs on ScaleLong, the authors discover a consistent U-shaped performance pattern where models exhibit stronger comprehension at the shortest (Clip) and longest (Story) temporal scales but show noticeably reduced performance at intermediate scales (Shot and Event). This finding provides critical insights into how current MLLMs process information at different temporal granularities in long videos.
[60] Lumiere: A Space-Time Diffusion Model for Video Generation PDF
[61] UniVTG: Towards Unified Video-Language Temporal Grounding PDF
[62] DroFormer: temporal action detection with drop mechanism of attention PDF
[63] ViT and RNN for Temporal and Spatial Analysis in Video Sequences PDF
[64] Do language models understand time? PDF
[65] Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution PDF
[66] VMC: Video Motion Customization Using Temporal Attention Adaption for Text-to-Video Diffusion Models PDF
[67] FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding PDF
[68] Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking PDF
[69] Point Spatio-Temporal Pyramid Network for Point Cloud Video Understanding PDF
Insights on visual token allocation for multi-timescale understanding
The authors conduct ablation studies demonstrating that strategically increasing visual token allocation consistently improves MLLM performance across all evaluated timescales. Their analysis reveals that the optimal allocation of visual tokens between frame count and resolution depends on the target timescale, providing valuable guidance for future model development in long-video understanding.