MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present MMSI-Bench, a comprehensive benchmark containing 1,000 human-curated multiple-choice questions designed to evaluate multimodal large language models on spatial reasoning tasks that require integrating information across multiple images. The benchmark covers ten fundamental spatial reasoning task types plus multi-step reasoning, spanning diverse real-world scenarios.
The authors develop a fully human-centric design approach where expert researchers manually select image sets, create novel and challenging questions that cannot be answered from single images, and provide detailed step-by-step reasoning annotations. This methodology ensures high quality, diversity, and difficulty compared to template-based approaches.
The authors introduce an automated analysis method that uses the human-annotated reasoning processes to systematically categorize model failures into four error types: grounding errors, overlap-matching and scene-reconstruction errors, situation-transformation reasoning errors, and spatial-logic errors. This enables scalable diagnosis of spatial reasoning capabilities.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[8] Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving PDF
[9] Space-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence PDF
[35] Spatial reasoning with vision-language models in ego-centric multi-view scenes PDF
[49] Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning? PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
MMSI-Bench benchmark for multi-image spatial intelligence
The authors present MMSI-Bench, a comprehensive benchmark containing 1,000 human-curated multiple-choice questions designed to evaluate multimodal large language models on spatial reasoning tasks that require integrating information across multiple images. The benchmark covers ten fundamental spatial reasoning task types plus multi-step reasoning, spanning diverse real-world scenarios.
[8] Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving PDF
[10] Mm-spatial: Exploring 3d spatial understanding in multimodal llms PDF
[23] MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models PDF
[51] Muirbench: A comprehensive benchmark for robust multi-image understanding PDF
[52] Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Mutimodal Models PDF
[56] MultiChartQA: Benchmarking vision-language models on multi-chart problems PDF
[57] MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning PDF
Human-centric benchmark construction methodology
The authors develop a fully human-centric design approach where expert researchers manually select image sets, create novel and challenging questions that cannot be answered from single images, and provide detailed step-by-step reasoning annotations. This methodology ensures high quality, diversity, and difficulty compared to template-based approaches.
[58] MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark PDF
[59] MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning PDF
[60] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF
[61] ExpVid: A Benchmark for Experiment Video Understanding & Reasoning PDF
[63] MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs PDF
[64] MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark PDF
[65] EthicsMH: A Pilot Benchmark for Ethical Reasoning in Mental Health AI PDF
[66] Hallucination-aware multimodal benchmark for gastrointestinal image analysis with large vision-language models PDF
[67] MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence PDF
Automated error analysis pipeline using annotated reasoning
The authors introduce an automated analysis method that uses the human-annotated reasoning processes to systematically categorize model failures into four error types: grounding errors, overlap-matching and scene-reconstruction errors, situation-transformation reasoning errors, and spatial-logic errors. This enables scalable diagnosis of spatial reasoning capabilities.