IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
Overview
Overall Novelty Assessment
The paper introduces IV-Bench, a benchmark for evaluating image-grounded video perception and reasoning across 966 videos and 2,560 queries spanning 13 tasks. Within the taxonomy, it resides in the 'Benchmarks and Evaluation Frameworks' leaf, which contains only two papers total (including this one and MME-CoF Benchmark). This is a notably sparse research direction compared to more crowded areas like Temporal Grounding (6 papers) or Chain-of-Thought Video Reasoning (4 papers), suggesting that comprehensive benchmark construction for image-grounded video understanding remains an underexplored niche.
The taxonomy reveals that IV-Bench sits adjacent to several active research branches. Video Question Answering and Grounding (15 papers across 5 sub-areas) addresses temporal localization and causal reasoning, while Spatial-Temporal Grounding (6 papers) focuses on fine-grained object tracking. Multimodal Video Understanding Models (4 papers) develop unified architectures for perception and generation. IV-Bench's emphasis on image-grounded perception bridges these areas by requiring models to integrate static visual context with temporal dynamics, a capability distinct from purely temporal grounding or general video understanding without explicit image anchoring.
Among 29 candidates examined, none clearly refute any of the three contributions. Contribution A (the benchmark itself) examined 10 candidates with zero refutable overlaps; Contribution B (evaluation of 28 MLLMs) examined 10 with zero refutations; Contribution C (ablation studies) examined 9 with zero refutations. This suggests that within the limited search scope, no prior work provides a directly comparable benchmark combining image-grounded queries, multi-task coverage, and systematic MLLM evaluation. However, the modest search scale (29 papers) means the analysis captures top semantic matches rather than exhaustive prior art.
Based on the limited literature search, IV-Bench appears to occupy a relatively novel position within a sparse benchmark subfield. The taxonomy structure confirms that comprehensive evaluation frameworks for image-grounded video understanding are underrepresented compared to method-focused branches. While the search examined 29 candidates without finding clear refutations, this reflects the scope of top-K semantic retrieval rather than a definitive novelty claim across all video understanding literature.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present IV-Bench, a novel benchmark comprising 966 videos paired with 2,560 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) spanning 5 distinct categories. The benchmark uniquely uses externally sourced reference images rather than video frames to provide visual context for queries.
The authors conduct extensive evaluations on 28 MLLMs including both open-source and closed-source models, revealing substantial performance gaps with the best model achieving only 28.9% accuracy compared to 88.8% human performance, demonstrating a significant research gap in this capability.
The authors provide ablation studies demonstrating that model scale critically affects the ability to utilize visual context, with larger models benefiting significantly from image contexts while smaller models show minimal improvements. They also analyze the impact of visual token allocation and optimal placement of image queries relative to video frames.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[47] Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
IV-Bench benchmark for image-grounded video perception and reasoning
The authors present IV-Bench, a novel benchmark comprising 966 videos paired with 2,560 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) spanning 5 distinct categories. The benchmark uniquely uses externally sourced reference images rather than video frames to provide visual context for queries.
[35] VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding PDF
[61] Glamm: Pixel grounding large multimodal model PDF
[62] Groundinggpt: Language enhanced multi-modal grounding model PDF
[63] Perception Test: A Diagnostic Benchmark for Multimodal Video Models PDF
[64] Videoglamm: A large multimodal model for pixel-level visual grounding in videos PDF
[65] Ok-vqa: A visual question answering benchmark requiring external knowledge PDF
[66] OpenEvents V1: Large-Scale Benchmark Dataset for Multimodal Event Grounding PDF
[67] Vidtext: Towards comprehensive evaluation for video text understanding PDF
[68] Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering PDF
[69] On Pursuit of Designing Multi-modal Transformer for Video Grounding PDF
Comprehensive evaluation of 28 state-of-the-art MLLMs
The authors conduct extensive evaluations on 28 MLLMs including both open-source and closed-source models, revealing substantial performance gaps with the best model achieving only 28.9% accuracy compared to 88.8% human performance, demonstrating a significant research gap in this capability.
[51] Seed-bench: Benchmarking multimodal large language models PDF
[52] Blink: Multimodal large language models can see but not perceive PDF
[53] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding PDF
[54] SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension PDF
[55] Videollama 3: Frontier multimodal foundation models for image and video understanding PDF
[56] Video understanding with large language models: A survey PDF
[57] Token-Efficient Long Video Understanding for Multimodal LLMs PDF
[58] Mvbench: A comprehensive multi-modal video understanding benchmark PDF
[59] From seconds to hours: Reviewing multimodal large language models on comprehensive long video understanding PDF
[60] Vidi: Large multimodal models for video understanding and editing PDF
Ablation studies and insights on model design factors
The authors provide ablation studies demonstrating that model scale critically affects the ability to utilize visual context, with larger models benefiting significantly from image contexts while smaller models show minimal improvements. They also analyze the impact of visual token allocation and optimal placement of image queries relative to video frames.