IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Image-Grounded Video Perception and ReasoningMultimodal llmsBenchmark

Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose \textbf{IV-Bench}, the first comprehensive benchmark for evaluating \emph{Image-Grounded Video Perception and Reasoning}. IV-Bench consists of 966 videos paired with 2,560 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) and 5 representative categories. Extensive evaluations of state-of-the-art open-source (e.g., InternVL2.5, Qwen2.5-VL) and closed-source (e.g., GPT-4o, Gemini2-Flash and Gemini2-Pro) MLLMs demonstrate that current models substantially underperform in image-grounded video Perception and Reasoning, merely achieving at most 28.9% accuracy. Further analysis reveals key factors influencing model performance on IV-Bench, including inference pattern, frame number, and resolution. These findings collectively provide valuable insights for future research. Our codes and data are released in \url{https://anonymous.4open.science/r/IV-Bench-A3F7}.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces IV-Bench, a benchmark for evaluating image-grounded video perception and reasoning across 966 videos and 2,560 queries spanning 13 tasks. Within the taxonomy, it resides in the 'Benchmarks and Evaluation Frameworks' leaf, which contains only two papers total (including this one and MME-CoF Benchmark). This is a notably sparse research direction compared to more crowded areas like Temporal Grounding (6 papers) or Chain-of-Thought Video Reasoning (4 papers), suggesting that comprehensive benchmark construction for image-grounded video understanding remains an underexplored niche.

The taxonomy reveals that IV-Bench sits adjacent to several active research branches. Video Question Answering and Grounding (15 papers across 5 sub-areas) addresses temporal localization and causal reasoning, while Spatial-Temporal Grounding (6 papers) focuses on fine-grained object tracking. Multimodal Video Understanding Models (4 papers) develop unified architectures for perception and generation. IV-Bench's emphasis on image-grounded perception bridges these areas by requiring models to integrate static visual context with temporal dynamics, a capability distinct from purely temporal grounding or general video understanding without explicit image anchoring.

Among 29 candidates examined, none clearly refute any of the three contributions. Contribution A (the benchmark itself) examined 10 candidates with zero refutable overlaps; Contribution B (evaluation of 28 MLLMs) examined 10 with zero refutations; Contribution C (ablation studies) examined 9 with zero refutations. This suggests that within the limited search scope, no prior work provides a directly comparable benchmark combining image-grounded queries, multi-task coverage, and systematic MLLM evaluation. However, the modest search scale (29 papers) means the analysis captures top semantic matches rather than exhaustive prior art.

Based on the limited literature search, IV-Bench appears to occupy a relatively novel position within a sparse benchmark subfield. The taxonomy structure confirms that comprehensive evaluation frameworks for image-grounded video understanding are underrepresented compared to method-focused branches. While the search examined 29 candidates without finding clear refutations, this reflects the scope of top-K semantic retrieval rather than a definitive novelty claim across all video understanding literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: image-grounded video perception and reasoning. The field has evolved into a rich ecosystem organized around seven major branches. Video Question Answering and Grounding focuses on extracting answers and localizing relevant moments from video content, often requiring temporal alignment between queries and visual evidence. Video Reasoning Frameworks and Architectures develop systematic approaches for multi-step inference, incorporating chain-of-thought mechanisms and structured reasoning pipelines such as Video-of-thought[1] and VTimeCot[5]. Spatial-Temporal Grounding and Segmentation addresses precise localization of objects and events across both space and time, while Multimodal Video Understanding Models build unified architectures that integrate visual, textual, and sometimes audio signals. Specialized Video Understanding Tasks target domain-specific challenges like scientific reasoning in SciVideoBench[14] or mathematical problem-solving in VideoMathQA[6]. Benchmarks and Evaluation Frameworks provide standardized testbeds for assessing model capabilities, and Video Generation and Editing explores synthesis and manipulation of video content. Recent work reveals a tension between end-to-end learning and structured reasoning. Many studies in the reasoning branch emphasize explicit intermediate steps and temporal decomposition, as seen in Video-R1[17] and FrameThinker[43], while others pursue tighter integration of perception and reasoning within unified architectures. IV-Bench[0] sits squarely within the Benchmarks and Evaluation Frameworks branch, alongside MME-CoF Benchmark[47], both aiming to rigorously test video understanding systems. Compared to Star Benchmark[3] which focuses on spatial-temporal reasoning evaluation, or CG-Bench[22] which emphasizes compositional generalization, IV-Bench[0] appears to target comprehensive assessment of image-grounded video perception capabilities. This positioning reflects a broader trend toward developing evaluation protocols that can capture the nuanced interplay between visual grounding, temporal reasoning, and question answering that defines modern video understanding systems.

Claimed Contributions

IV-Bench benchmark for image-grounded video perception and reasoning

10 retrieved papers

The authors present IV-Bench, a novel benchmark comprising 966 videos paired with 2,560 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) spanning 5 distinct categories. The benchmark uniquely uses externally sourced reference images rather than video frames to provide visual context for queries.

10 retrieved papers

Comprehensive evaluation of 28 state-of-the-art MLLMs

10 retrieved papers

The authors conduct extensive evaluations on 28 MLLMs including both open-source and closed-source models, revealing substantial performance gaps with the best model achieving only 28.9% accuracy compared to 88.8% human performance, demonstrating a significant research gap in this capability.

10 retrieved papers

Ablation studies and insights on model design factors

9 retrieved papers

The authors provide ablation studies demonstrating that model scale critically affects the ability to utilize visual context, with larger models benefiting significantly from image contexts while smaller models show minimal improvements. They also analyze the impact of visual token allocation and optimal placement of image queries relative to video frames.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[47] Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark PDF

Guo Ziyu, Chen Xin-yan, Zhang, Renrui, Qi Yu, Jiang, Dongzhi, Li, Xiangtai, Zhang Man-yuan, Li Hongsheng, Heng, Pheng-Ann (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

IV-Bench benchmark for image-grounded video perception and reasoning

[35] VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding PDF

Cannot Refute

[61] Glamm: Pixel grounding large multimodal model PDF

Cannot Refute

[62] Groundinggpt: Language enhanced multi-modal grounding model PDF

Cannot Refute

[63] Perception Test: A Diagnostic Benchmark for Multimodal Video Models PDF

Cannot Refute

[64] Videoglamm: A large multimodal model for pixel-level visual grounding in videos PDF

Cannot Refute

[65] Ok-vqa: A visual question answering benchmark requiring external knowledge PDF

Cannot Refute

[66] OpenEvents V1: Large-Scale Benchmark Dataset for Multimodal Event Grounding PDF

Cannot Refute

[67] Vidtext: Towards comprehensive evaluation for video text understanding PDF

Cannot Refute

[68] Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering PDF

Cannot Refute

[69] On Pursuit of Designing Multi-modal Transformer for Video Grounding PDF

Cannot Refute

Contribution

Comprehensive evaluation of 28 state-of-the-art MLLMs

[51] Seed-bench: Benchmarking multimodal large language models PDF

Cannot Refute

[52] Blink: Multimodal large language models can see but not perceive PDF

Cannot Refute

[53] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding PDF

Cannot Refute

[54] SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension PDF

Cannot Refute

[55] Videollama 3: Frontier multimodal foundation models for image and video understanding PDF

Cannot Refute

[56] Video understanding with large language models: A survey PDF

Cannot Refute

[57] Token-Efficient Long Video Understanding for Multimodal LLMs PDF

Cannot Refute

[58] Mvbench: A comprehensive multi-modal video understanding benchmark PDF

Cannot Refute

[59] From seconds to hours: Reviewing multimodal large language models on comprehensive long video understanding PDF

Cannot Refute

[60] Vidi: Large multimodal models for video understanding and editing PDF

Cannot Refute

Contribution

Ablation studies and insights on model design factors

[70] Vidtok: A versatile and open-source video tokenizer PDF

Cannot Refute

[71] Dense video understanding with gated residual tokenization PDF

Cannot Refute

[72] Need for speed: A benchmark for higher frame rate object tracking PDF

Cannot Refute

[73] Similarity based image selection with frame rate adaptation and local event detection in wireless video sensor networks PDF

Cannot Refute

[74] Rate Adaptation in Dynamic Adaptive Video Streaming Over HTTP PDF

Cannot Refute

[75] VA-RED: Video Adaptive Redundancy Reduction PDF

Cannot Refute

[76] VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning PDF

Cannot Refute

[77] A study of real-time packet video quality using random neural networks PDF

Cannot Refute

[78] TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models PDF

Cannot Refute

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[47] Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark PDF

Contribution Analysis

IV-Bench benchmark for image-grounded video perception and reasoning

[35] VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding PDF

[61] Glamm: Pixel grounding large multimodal model PDF

[62] Groundinggpt: Language enhanced multi-modal grounding model PDF

[63] Perception Test: A Diagnostic Benchmark for Multimodal Video Models PDF

[64] Videoglamm: A large multimodal model for pixel-level visual grounding in videos PDF

[65] Ok-vqa: A visual question answering benchmark requiring external knowledge PDF

[66] OpenEvents V1: Large-Scale Benchmark Dataset for Multimodal Event Grounding PDF

[67] Vidtext: Towards comprehensive evaluation for video text understanding PDF

[68] Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering PDF

[69] On Pursuit of Designing Multi-modal Transformer for Video Grounding PDF

Comprehensive evaluation of 28 state-of-the-art MLLMs

[51] Seed-bench: Benchmarking multimodal large language models PDF

[52] Blink: Multimodal large language models can see but not perceive PDF

[53] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding PDF

[54] SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension PDF

[55] Videollama 3: Frontier multimodal foundation models for image and video understanding PDF

[56] Video understanding with large language models: A survey PDF

[57] Token-Efficient Long Video Understanding for Multimodal LLMs PDF

[58] Mvbench: A comprehensive multi-modal video understanding benchmark PDF

[59] From seconds to hours: Reviewing multimodal large language models on comprehensive long video understanding PDF

[60] Vidi: Large multimodal models for video understanding and editing PDF

Ablation studies and insights on model design factors

[70] Vidtok: A versatile and open-source video tokenizer PDF

[71] Dense video understanding with gated residual tokenization PDF

[72] Need for speed: A benchmark for higher frame rate object tracking PDF

[73] Similarity based image selection with frame rate adaptation and local event detection in wireless video sensor networks PDF

[74] Rate Adaptation in Dynamic Adaptive Video Streaming Over HTTP PDF

[75] VA-RED: Video Adaptive Redundancy Reduction PDF

[76] VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning PDF

[77] A study of real-time packet video quality using random neural networks PDF

[78] TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models PDF

Table of Contents