IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Image-Grounded Video Perception and ReasoningMultimodal llmsBenchmark
Abstract:

Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose \textbf{IV-Bench}, the first comprehensive benchmark for evaluating \emph{Image-Grounded Video Perception and Reasoning}. IV-Bench consists of 966 videos paired with 2,560 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) and 5 representative categories. Extensive evaluations of state-of-the-art open-source (e.g., InternVL2.5, Qwen2.5-VL) and closed-source (e.g., GPT-4o, Gemini2-Flash and Gemini2-Pro) MLLMs demonstrate that current models substantially underperform in image-grounded video Perception and Reasoning, merely achieving at most 28.9% accuracy. Further analysis reveals key factors influencing model performance on IV-Bench, including inference pattern, frame number, and resolution. These findings collectively provide valuable insights for future research. Our codes and data are released in \url{https://anonymous.4open.science/r/IV-Bench-A3F7}.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces IV-Bench, a benchmark for evaluating image-grounded video perception and reasoning across 966 videos and 2,560 queries spanning 13 tasks. Within the taxonomy, it resides in the 'Benchmarks and Evaluation Frameworks' leaf, which contains only two papers total (including this one and MME-CoF Benchmark). This is a notably sparse research direction compared to more crowded areas like Temporal Grounding (6 papers) or Chain-of-Thought Video Reasoning (4 papers), suggesting that comprehensive benchmark construction for image-grounded video understanding remains an underexplored niche.

The taxonomy reveals that IV-Bench sits adjacent to several active research branches. Video Question Answering and Grounding (15 papers across 5 sub-areas) addresses temporal localization and causal reasoning, while Spatial-Temporal Grounding (6 papers) focuses on fine-grained object tracking. Multimodal Video Understanding Models (4 papers) develop unified architectures for perception and generation. IV-Bench's emphasis on image-grounded perception bridges these areas by requiring models to integrate static visual context with temporal dynamics, a capability distinct from purely temporal grounding or general video understanding without explicit image anchoring.

Among 29 candidates examined, none clearly refute any of the three contributions. Contribution A (the benchmark itself) examined 10 candidates with zero refutable overlaps; Contribution B (evaluation of 28 MLLMs) examined 10 with zero refutations; Contribution C (ablation studies) examined 9 with zero refutations. This suggests that within the limited search scope, no prior work provides a directly comparable benchmark combining image-grounded queries, multi-task coverage, and systematic MLLM evaluation. However, the modest search scale (29 papers) means the analysis captures top semantic matches rather than exhaustive prior art.

Based on the limited literature search, IV-Bench appears to occupy a relatively novel position within a sparse benchmark subfield. The taxonomy structure confirms that comprehensive evaluation frameworks for image-grounded video understanding are underrepresented compared to method-focused branches. While the search examined 29 candidates without finding clear refutations, this reflects the scope of top-K semantic retrieval rather than a definitive novelty claim across all video understanding literature.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: image-grounded video perception and reasoning. The field has evolved into a rich ecosystem organized around seven major branches. Video Question Answering and Grounding focuses on extracting answers and localizing relevant moments from video content, often requiring temporal alignment between queries and visual evidence. Video Reasoning Frameworks and Architectures develop systematic approaches for multi-step inference, incorporating chain-of-thought mechanisms and structured reasoning pipelines such as Video-of-thought[1] and VTimeCot[5]. Spatial-Temporal Grounding and Segmentation addresses precise localization of objects and events across both space and time, while Multimodal Video Understanding Models build unified architectures that integrate visual, textual, and sometimes audio signals. Specialized Video Understanding Tasks target domain-specific challenges like scientific reasoning in SciVideoBench[14] or mathematical problem-solving in VideoMathQA[6]. Benchmarks and Evaluation Frameworks provide standardized testbeds for assessing model capabilities, and Video Generation and Editing explores synthesis and manipulation of video content. Recent work reveals a tension between end-to-end learning and structured reasoning. Many studies in the reasoning branch emphasize explicit intermediate steps and temporal decomposition, as seen in Video-R1[17] and FrameThinker[43], while others pursue tighter integration of perception and reasoning within unified architectures. IV-Bench[0] sits squarely within the Benchmarks and Evaluation Frameworks branch, alongside MME-CoF Benchmark[47], both aiming to rigorously test video understanding systems. Compared to Star Benchmark[3] which focuses on spatial-temporal reasoning evaluation, or CG-Bench[22] which emphasizes compositional generalization, IV-Bench[0] appears to target comprehensive assessment of image-grounded video perception capabilities. This positioning reflects a broader trend toward developing evaluation protocols that can capture the nuanced interplay between visual grounding, temporal reasoning, and question answering that defines modern video understanding systems.

Claimed Contributions

IV-Bench benchmark for image-grounded video perception and reasoning

The authors present IV-Bench, a novel benchmark comprising 966 videos paired with 2,560 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) spanning 5 distinct categories. The benchmark uniquely uses externally sourced reference images rather than video frames to provide visual context for queries.

10 retrieved papers
Comprehensive evaluation of 28 state-of-the-art MLLMs

The authors conduct extensive evaluations on 28 MLLMs including both open-source and closed-source models, revealing substantial performance gaps with the best model achieving only 28.9% accuracy compared to 88.8% human performance, demonstrating a significant research gap in this capability.

10 retrieved papers
Ablation studies and insights on model design factors

The authors provide ablation studies demonstrating that model scale critically affects the ability to utilize visual context, with larger models benefiting significantly from image contexts while smaller models show minimal improvements. They also analyze the impact of visual token allocation and optimal placement of image queries relative to video frames.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

IV-Bench benchmark for image-grounded video perception and reasoning

The authors present IV-Bench, a novel benchmark comprising 966 videos paired with 2,560 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) spanning 5 distinct categories. The benchmark uniquely uses externally sourced reference images rather than video frames to provide visual context for queries.

Contribution

Comprehensive evaluation of 28 state-of-the-art MLLMs

The authors conduct extensive evaluations on 28 MLLMs including both open-source and closed-source models, revealing substantial performance gaps with the best model achieving only 28.9% accuracy compared to 88.8% human performance, demonstrating a significant research gap in this capability.

Contribution

Ablation studies and insights on model design factors

The authors provide ablation studies demonstrating that model scale critically affects the ability to utilize visual context, with larger models benefiting significantly from image contexts while smaller models show minimal improvements. They also analyze the impact of visual token allocation and optimal placement of image queries relative to video frames.