ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
Multimodal Large Language ModelVideo Large Language ModelNature ScienceBenchmark
Abstract:

Multimodal Large Language Models (MLLMs) hold promise for accelerating scientific discovery by interpreting complex experimental procedures. However, their true capabilities are poorly understood, as existing benchmarks neglect the fine-grained and long-horizon nature of authentic laboratory work, especially in wet-lab settings. To bridge this gap, we introduce ExpVid, the first benchmark designed to systematically evaluate MLLMs on scientific experiment videos. Curated from peer-reviewed video publications, ExpVid features a new three-level task hierarchy that mirrors the scientific process: (1) Fine-grained Perception of tools, materials, and actions; (2) Procedural Understanding of step order and completeness; and (3) Scientific Reasoning that connects the full experiment to its published conclusions. Our vision-centric annotation pipeline, combining automated generation with multi-disciplinary expert validation, ensures that tasks require visual grounding. We evaluate 19 leading MLLMs on ExpVid and find that while they excel at coarse-grained recognition, they struggle with disambiguating fine details, tracking state changes over time, and linking experimental procedures to scientific outcomes. Our results reveal a notable performance gap between proprietary and open-source models, particularly in high-order reasoning. ExpVid not only provides a diagnostic tool but also charts a roadmap for developing MLLMs capable of becoming trustworthy partners in scientific experimentation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ExpVid, a benchmark for evaluating multimodal large language models on scientific experiment videos through a three-level task hierarchy spanning fine-grained perception, procedural understanding, and scientific reasoning. It resides in the 'Scientific Experiment Video Understanding' leaf, which contains only three papers total including this work. This represents a sparse, emerging research direction within the broader taxonomy, suggesting the paper addresses a relatively underexplored niche compared to crowded areas like general video understanding benchmarks.

The taxonomy reveals that ExpVid's immediate neighbors include SciVideoBench and Scientific Activity Recognition, both targeting laboratory content but with different emphases. Adjacent leaves cover knowledge-intensive technical video analysis and specialized domain applications such as materials science and egocentric instructional interactions. The scope note for this leaf explicitly focuses on experimental procedures and laboratory protocols, distinguishing it from broader scientific content evaluation. This positioning indicates the paper carves out a procedural-centric angle within scientific video understanding, diverging from general scientific question answering or technical terminology recognition found in neighboring branches.

Among thirty candidates examined, the benchmark contribution shows one refutable candidate out of ten examined, suggesting some prior work in scientific experiment benchmarking exists but is limited. The vision-centric annotation pipeline and comprehensive MLLM evaluation contributions each examined ten candidates with zero refutations, indicating these methodological and empirical aspects appear more novel within the search scope. The statistics reflect a modest literature search scale, meaning the analysis captures top semantic matches rather than exhaustive coverage of all potentially related work in scientific video understanding or annotation methodologies.

Based on the limited search scope of thirty candidates, the work appears to occupy a relatively novel position in a sparse research area, particularly in its procedural focus and annotation approach. However, the presence of at least one overlapping benchmark suggests the core idea of scientific experiment video evaluation is not entirely unprecedented. The analysis does not cover exhaustive citation networks or domain-specific venues that might reveal additional related efforts.

Taxonomy

Core-task Taxonomy Papers
41
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: multimodal large language model evaluation on scientific experiment videos. The field of multimodal large language model evaluation has grown into a diverse landscape with several major branches. General Video Understanding Benchmarks such as Video-MME[1], MVBench[2], and SEED-Bench[3] establish broad capabilities across everyday video content, while Scientific and Specialized Domain Evaluation targets niche areas like scientific experiments, materials science, and domain-specific reasoning. Multimodal LLM Architectures and Training explores model design choices, including works like Macaw-LLM[4] and Valley[6], whereas Practical Applications and Task-Specific Systems address real-world deployment scenarios ranging from interactive search to robotic planning. Trustworthiness and Robustness Evaluation examines model reliability through hallucination detection and safety assessments, and Egocentric and First-Person Video Understanding focuses on perspective-specific challenges in videos captured from a wearer's viewpoint. Within the Scientific and Specialized Domain Evaluation branch, a small but growing cluster of works addresses the unique challenges of understanding procedural and experimental content. SciVideoBench[9] and Scientific Activity Recognition[21] explore recognition and reasoning in laboratory settings, while Scientific Multimodal Summarization[5] tackles condensing complex scientific narratives. ExpVid[0] situates itself squarely in this scientific experiment video understanding cluster, emphasizing fine-grained procedural comprehension and temporal reasoning that general benchmarks like Video-MME[1] or MVBench[2] do not fully capture. Compared to SciVideoBench[9], which focuses on broader scientific video question answering, ExpVid[0] appears to drill deeper into experiment-specific phenomena such as causal relationships and procedural steps. This specialized focus reflects an emerging recognition that domain expertise and procedural nuance require tailored evaluation beyond what general-purpose video benchmarks provide.

Claimed Contributions

ExpVid benchmark for scientific experiment video understanding

The authors introduce ExpVid, a novel benchmark designed to evaluate multimodal large language models on authentic scientific experiment videos. The benchmark features a three-level task hierarchy spanning fine-grained perception of tools and materials, procedural understanding of experimental steps, and scientific reasoning connecting procedures to published conclusions.

10 retrieved papers
Can Refute
Vision-centric annotation pipeline with expert validation

The authors develop a semi-automatic annotation method that combines automated question generation from videos and transcripts with multi-disciplinary PhD-level expert verification. The pipeline ensures tasks require visual grounding rather than relying solely on textual cues or background knowledge.

10 retrieved papers
Comprehensive evaluation revealing MLLM limitations in experimental settings

The authors evaluate 19 state-of-the-art multimodal models on ExpVid, revealing that while models excel at coarse recognition, they struggle with fine-grained visual disambiguation, temporal state tracking, and connecting experimental procedures to scientific conclusions, particularly showing gaps between proprietary and open-source models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ExpVid benchmark for scientific experiment video understanding

The authors introduce ExpVid, a novel benchmark designed to evaluate multimodal large language models on authentic scientific experiment videos. The benchmark features a three-level task hierarchy spanning fine-grained perception of tools and materials, procedural understanding of experimental steps, and scientific reasoning connecting procedures to published conclusions.

Contribution

Vision-centric annotation pipeline with expert validation

The authors develop a semi-automatic annotation method that combines automated question generation from videos and transcripts with multi-disciplinary PhD-level expert verification. The pipeline ensures tasks require visual grounding rather than relying solely on textual cues or background knowledge.

Contribution

Comprehensive evaluation revealing MLLM limitations in experimental settings

The authors evaluate 19 state-of-the-art multimodal models on ExpVid, revealing that while models excel at coarse recognition, they struggle with fine-grained visual disambiguation, temporal state tracking, and connecting experimental procedures to scientific conclusions, particularly showing gaps between proprietary and open-source models.