ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

Multimodal Large Language ModelVideo Large Language ModelNature ScienceBenchmark

Multimodal Large Language Models (MLLMs) hold promise for accelerating scientific discovery by interpreting complex experimental procedures. However, their true capabilities are poorly understood, as existing benchmarks neglect the fine-grained and long-horizon nature of authentic laboratory work, especially in wet-lab settings. To bridge this gap, we introduce ExpVid, the first benchmark designed to systematically evaluate MLLMs on scientific experiment videos. Curated from peer-reviewed video publications, ExpVid features a new three-level task hierarchy that mirrors the scientific process: (1) Fine-grained Perception of tools, materials, and actions; (2) Procedural Understanding of step order and completeness; and (3) Scientific Reasoning that connects the full experiment to its published conclusions. Our vision-centric annotation pipeline, combining automated generation with multi-disciplinary expert validation, ensures that tasks require visual grounding. We evaluate 19 leading MLLMs on ExpVid and find that while they excel at coarse-grained recognition, they struggle with disambiguating fine details, tracking state changes over time, and linking experimental procedures to scientific outcomes. Our results reveal a notable performance gap between proprietary and open-source models, particularly in high-order reasoning. ExpVid not only provides a diagnostic tool but also charts a roadmap for developing MLLMs capable of becoming trustworthy partners in scientific experimentation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ExpVid, a benchmark for evaluating multimodal large language models on scientific experiment videos through a three-level task hierarchy spanning fine-grained perception, procedural understanding, and scientific reasoning. It resides in the 'Scientific Experiment Video Understanding' leaf, which contains only three papers total including this work. This represents a sparse, emerging research direction within the broader taxonomy, suggesting the paper addresses a relatively underexplored niche compared to crowded areas like general video understanding benchmarks.

The taxonomy reveals that ExpVid's immediate neighbors include SciVideoBench and Scientific Activity Recognition, both targeting laboratory content but with different emphases. Adjacent leaves cover knowledge-intensive technical video analysis and specialized domain applications such as materials science and egocentric instructional interactions. The scope note for this leaf explicitly focuses on experimental procedures and laboratory protocols, distinguishing it from broader scientific content evaluation. This positioning indicates the paper carves out a procedural-centric angle within scientific video understanding, diverging from general scientific question answering or technical terminology recognition found in neighboring branches.

Among thirty candidates examined, the benchmark contribution shows one refutable candidate out of ten examined, suggesting some prior work in scientific experiment benchmarking exists but is limited. The vision-centric annotation pipeline and comprehensive MLLM evaluation contributions each examined ten candidates with zero refutations, indicating these methodological and empirical aspects appear more novel within the search scope. The statistics reflect a modest literature search scale, meaning the analysis captures top semantic matches rather than exhaustive coverage of all potentially related work in scientific video understanding or annotation methodologies.

Based on the limited search scope of thirty candidates, the work appears to occupy a relatively novel position in a sparse research area, particularly in its procedural focus and annotation approach. However, the presence of at least one overlapping benchmark suggests the core idea of scientific experiment video evaluation is not entirely unprecedented. The analysis does not cover exhaustive citation networks or domain-specific venues that might reveal additional related efforts.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: multimodal large language model evaluation on scientific experiment videos. The field of multimodal large language model evaluation has grown into a diverse landscape with several major branches. General Video Understanding Benchmarks such as Video-MME[1], MVBench[2], and SEED-Bench[3] establish broad capabilities across everyday video content, while Scientific and Specialized Domain Evaluation targets niche areas like scientific experiments, materials science, and domain-specific reasoning. Multimodal LLM Architectures and Training explores model design choices, including works like Macaw-LLM[4] and Valley[6], whereas Practical Applications and Task-Specific Systems address real-world deployment scenarios ranging from interactive search to robotic planning. Trustworthiness and Robustness Evaluation examines model reliability through hallucination detection and safety assessments, and Egocentric and First-Person Video Understanding focuses on perspective-specific challenges in videos captured from a wearer's viewpoint. Within the Scientific and Specialized Domain Evaluation branch, a small but growing cluster of works addresses the unique challenges of understanding procedural and experimental content. SciVideoBench[9] and Scientific Activity Recognition[21] explore recognition and reasoning in laboratory settings, while Scientific Multimodal Summarization[5] tackles condensing complex scientific narratives. ExpVid[0] situates itself squarely in this scientific experiment video understanding cluster, emphasizing fine-grained procedural comprehension and temporal reasoning that general benchmarks like Video-MME[1] or MVBench[2] do not fully capture. Compared to SciVideoBench[9], which focuses on broader scientific video question answering, ExpVid[0] appears to drill deeper into experiment-specific phenomena such as causal relationships and procedural steps. This specialized focus reflects an emerging recognition that domain expertise and procedural nuance require tailored evaluation beyond what general-purpose video benchmarks provide.

Claimed Contributions

ExpVid benchmark for scientific experiment video understanding

Can Refute

10 retrieved papers

The authors introduce ExpVid, a novel benchmark designed to evaluate multimodal large language models on authentic scientific experiment videos. The benchmark features a three-level task hierarchy spanning fine-grained perception of tools and materials, procedural understanding of experimental steps, and scientific reasoning connecting procedures to published conclusions.

10 retrieved papers

Can Refute

Vision-centric annotation pipeline with expert validation

10 retrieved papers

The authors develop a semi-automatic annotation method that combines automated question generation from videos and transcripts with multi-disciplinary PhD-level expert verification. The pipeline ensures tasks require visual grounding rather than relying solely on textual cues or background knowledge.

10 retrieved papers

Comprehensive evaluation revealing MLLM limitations in experimental settings

10 retrieved papers

The authors evaluate 19 state-of-the-art multimodal models on ExpVid, revealing that while models excel at coarse recognition, they struggle with fine-grained visual disambiguation, temporal state tracking, and connecting experimental procedures to scientific conclusions, particularly showing gaps between proprietary and open-source models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] Scivideobench: Benchmarking scientific video reasoning in large multimodal models PDF

Deng, Andong, Yang, Taojiannan, YU Shoubin, Bansal, Mohit, Chen Chen, Wang Xiao-han (2025)

[21] Activity recognition in scientific experimentation using multimodal visual encoding PDF

Gianmarco Gabrieli, Irina Espejo-Morales, Dimitrios Christofidellis, Irina Espejo Morales, Mara Graziani, Andrea Giovannini, Federico Zipoli, Amol Thakkar, Antonio Foncubierta, Matteo Manica, Patrick W. Ruch (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ExpVid benchmark for scientific experiment video understanding

[9] Scivideobench: Benchmarking scientific video reasoning in large multimodal models PDF

Can Refute

[2] Mvbench: A comprehensive multi-modal video understanding benchmark PDF

Cannot Refute

[18] Worldsense: Evaluating real-world omnimodal understanding for multimodal llms PDF

Cannot Refute

[42] Mmvu: Measuring expert-level multi-discipline video understanding PDF

Cannot Refute

[43] Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs PDF

Cannot Refute

[44] MotionBench: Benchmarking and Improving Fine-Grained Video Motion Understanding for Vision Language Models PDF

Cannot Refute

[45] SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications PDF

Cannot Refute

[46] StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding PDF

Cannot Refute

[47] A shortcut-aware video-qa benchmark for physical understanding via minimal video pairs PDF

Cannot Refute

[48] Perception Test: A Diagnostic Benchmark for Multimodal Video Models PDF

Cannot Refute

Contribution

Vision-centric annotation pipeline with expert validation

[49] Finevision: Open data is all you need PDF

Cannot Refute

[50] Videocot: A video chain-of-thought dataset with active annotation tool PDF

Cannot Refute

[51] DeVAn: Dense Video Annotation for Video-Language Models PDF

Cannot Refute

[52] Man and the machine: Effects of AI-assisted human labeling on interactive annotation of real-time video streams PDF

Cannot Refute

[53] A new dataset for video-based cow behavior recognition PDF

Cannot Refute

[54] POPCat: Propagation of particles for complex annotation tasks PDF

Cannot Refute

[55] A field and video annotation guide for baited remote underwater stereoâvideo surveys of demersal fish assemblages PDF

Cannot Refute

[56] VideoPro: A visual analytics approach for interactive video programming PDF

Cannot Refute

[57] Fast machine learning annotation in the medical domain: a semi-automated video annotation tool for gastroenterologists PDF

Cannot Refute

[58] Machine learning based autism spectrum disorder detection from videos PDF

Cannot Refute

Contribution

Comprehensive evaluation revealing MLLM limitations in experimental settings

[59] Janus: Decoupling visual encoding for unified multimodal understanding and generation PDF

Cannot Refute

[60] MultiSkill: Evaluating large multimodal models for fine-grained alignment skills PDF

Cannot Refute

[61] MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts PDF

Cannot Refute

[62] Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps PDF

Cannot Refute

[63] Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning PDF

Cannot Refute

[64] EasyARC: Evaluating Vision Language Models on True Visual Reasoning PDF

Cannot Refute

[65] Contextual: Evaluating context-sensitive text-rich visual reasoning in large multimodal models PDF

Cannot Refute

[66] HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding PDF

Cannot Refute

[67] Evaluating mllms with multimodal multi-image reasoning benchmark PDF

Cannot Refute

[68] Enhancing compositional reasoning in vision-language models with synthetic preference data PDF

Cannot Refute

ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] Scivideobench: Benchmarking scientific video reasoning in large multimodal models PDF

[21] Activity recognition in scientific experimentation using multimodal visual encoding PDF

Contribution Analysis

ExpVid benchmark for scientific experiment video understanding

[9] Scivideobench: Benchmarking scientific video reasoning in large multimodal models PDF

[2] Mvbench: A comprehensive multi-modal video understanding benchmark PDF

[18] Worldsense: Evaluating real-world omnimodal understanding for multimodal llms PDF

[42] Mmvu: Measuring expert-level multi-discipline video understanding PDF

[43] Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs PDF

[44] MotionBench: Benchmarking and Improving Fine-Grained Video Motion Understanding for Vision Language Models PDF

[45] SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications PDF

[46] StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding PDF

[47] A shortcut-aware video-qa benchmark for physical understanding via minimal video pairs PDF

[48] Perception Test: A Diagnostic Benchmark for Multimodal Video Models PDF

Vision-centric annotation pipeline with expert validation

[49] Finevision: Open data is all you need PDF

[50] Videocot: A video chain-of-thought dataset with active annotation tool PDF

[51] DeVAn: Dense Video Annotation for Video-Language Models PDF

[52] Man and the machine: Effects of AI-assisted human labeling on interactive annotation of real-time video streams PDF

[53] A new dataset for video-based cow behavior recognition PDF

[54] POPCat: Propagation of particles for complex annotation tasks PDF

[55] A field and video annotation guide for baited remote underwater stereoâvideo surveys of demersal fish assemblages PDF

[56] VideoPro: A visual analytics approach for interactive video programming PDF

[57] Fast machine learning annotation in the medical domain: a semi-automated video annotation tool for gastroenterologists PDF

[58] Machine learning based autism spectrum disorder detection from videos PDF

Comprehensive evaluation revealing MLLM limitations in experimental settings

[59] Janus: Decoupling visual encoding for unified multimodal understanding and generation PDF

[60] MultiSkill: Evaluating large multimodal models for fine-grained alignment skills PDF

[61] MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts PDF

[62] Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps PDF

[63] Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning PDF

[64] EasyARC: Evaluating Vision Language Models on True Visual Reasoning PDF

[65] Contextual: Evaluating context-sensitive text-rich visual reasoning in large multimodal models PDF

[66] HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding PDF

[67] Evaluating mllms with multimodal multi-image reasoning benchmark PDF

[68] Enhancing compositional reasoning in vision-language models with synthetic preference data PDF

Table of Contents

[55] A field and video annotation guide for baited remote underwater stereoâvideo surveys of demersal fish assemblages PDF