SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Multimodal Large Language ModelsSpatial Reasoning

Humans can imagine and manipulate visual images mentally, a capability known as \textit{spatial visualization}. While many multi-modal benchmarks assess reasoning on visible visual information, the ability to infer unseen relationships through spatial visualization remains insufficiently evaluated as a spatial skill. This reliance on publicly sourced problems from IQ tests or math competitions risks data contamination and compromises assessment reliability. To this end, we introduce \textbf{\textit{SpatialViz-Bench}}, a comprehensive multi-modal benchmark for \textit{spatial visualization} with \emph{12} tasks across \emph{4} sub-abilities, comprising \emph{1,180} programmatically generated problems, a scalable framework that allows for expansion to ensure fair and continuously reliable evaluations. Our evaluation of \emph{27} Multi-modal Large Language Models (MLLMs) reveals wide performance variations, demonstrates the benchmark's strong discriminative power, and uncovers counter-intuitive findings: Chain-of-Thought (CoT) prompting paradoxically degrades accuracy on open-source models. Through statistical and qualitative analysis of error types, SpatialViz-Bench demonstrates that state-of-the-art MLLMs exhibit deficiencies in \textit{spatial visualization} tasks, thereby addressing a significant lacuna in the field.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SpatialViz-Bench, a benchmark targeting spatial visualization—the mental manipulation of visual imagery—across twelve tasks organized into four sub-abilities. It resides in the 'Spatial Visualization and Mental Manipulation' leaf of the taxonomy, which contains only three papers total, including this work. This is a relatively sparse research direction compared to neighboring leaves like 'General Spatial Reasoning Evaluation' (six papers) or '3D Scene and Spatial Understanding' (five papers), suggesting the specific focus on mental imagery and unseen spatial transformations is less crowded than broader spatial reasoning evaluation.

The taxonomy reveals that SpatialViz-Bench sits within the broader 'Spatial Reasoning Benchmarks and Evaluation' branch, which encompasses five distinct evaluation paradigms. Neighboring leaves address observable spatial relationships (General Spatial Reasoning), depth-based comprehension (3D Scene Understanding), and multi-view reasoning (Multi-Image Spatial Reasoning). The scope note for this leaf explicitly excludes 'observable spatial relationships or general reasoning,' positioning the work at the boundary between perceptual spatial tasks and higher-order mental transformation. This placement suggests the paper targets a capability gap between static visual understanding and dynamic spatial manipulation that sibling categories do not directly address.

Among thirty candidates examined, none were found to clearly refute any of the three contributions. The benchmark contribution examined ten candidates with zero refutable matches, as did the programmatic generation methodology and the systematic evaluation of twenty-seven models. This absence of overlapping prior work across all contributions, given the limited search scope, suggests the specific combination of spatial visualization focus, programmatic scalability, and comprehensive model diagnostics may be relatively unexplored. However, the search examined only top-K semantic matches and citations, not an exhaustive literature review, so undetected overlaps remain possible.

Based on the limited thirty-candidate search, the work appears to occupy a distinct position within spatial reasoning evaluation, particularly in its emphasis on mental manipulation rather than observable relationships. The sparse population of its taxonomy leaf and the absence of refuting candidates across all contributions suggest novelty, though the restricted search scope means this assessment reflects only the most semantically similar prior work. The counter-intuitive finding about Chain-of-Thought degradation on open-source models hints at diagnostic value beyond benchmark construction itself.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: spatial visualization in multi-modal large language models. The field has organized itself around several complementary directions. Spatial Reasoning Benchmarks and Evaluation focuses on designing tasks that probe models' abilities to understand and manipulate spatial information, ranging from mental rotation and visualization challenges to grounded spatial understanding in images and 3D scenes. Methods for Enhancing Spatial Reasoning explores architectural innovations, training strategies, and augmentation techniques that improve spatial capabilities, while Analysis and Limitations of Spatial Capabilities investigates where and why current models fail. Domain-Specific Spatial Applications adapts spatial reasoning to specialized contexts such as robotics, navigation, and medical imaging, and Surveys and Comprehensive Reviews synthesize progress across these areas. Auxiliary Techniques and Optimization addresses broader infrastructure concerns like efficiency and token compression that enable practical deployment of spatially-aware models. Within the benchmark landscape, a particularly active thread examines spatial visualization and mental manipulation—tasks requiring models to imagine transformations or reason about unseen perspectives. SpatialViz-Bench[0] situates itself in this cluster, emphasizing rigorous evaluation of visualization capabilities that go beyond static spatial relationships. Nearby works like Thinking in Space[1] and STAR-R1[11] similarly probe mental rotation and dynamic spatial reasoning, though they may differ in task granularity or the types of transformations tested. Earlier efforts such as Visual Spatial Reasoning[3] laid foundational evaluation paradigms, while recent benchmarks like Visulogic[2] and Visfactor[4] have expanded the scope to include logical spatial inference and factor-based decomposition. A central open question across these studies is whether models genuinely perform internal spatial transformations or rely on pattern matching, and how evaluation design can distinguish these mechanisms. SpatialViz-Bench[0] contributes to this ongoing conversation by targeting specific visualization competencies that reveal deeper spatial understanding.

Claimed Contributions

SpatialViz-Bench: A comprehensive benchmark for spatial visualization

10 retrieved papers

The authors present SpatialViz-Bench, a novel benchmark designed to formally evaluate spatial visualization capabilities of MLLMs. It is grounded in cognitive science and assesses 4 key sub-abilities through 12 distinct tasks, resulting in 1,180 examples across parameter-controlled difficulty levels.

10 retrieved papers

Scalable programmatic generation methodology

10 retrieved papers

The authors develop a programmatic generation pipeline integrating Python with FreeCAD to create novel test cases. This methodology allows scalable task expansion and prevents data contamination by dynamically updating the test bank through randomized generation.

10 retrieved papers

Systematic evaluation and diagnostic analysis of 27 MLLMs

10 retrieved papers

The authors conduct a comprehensive evaluation of 27 MLLMs on SpatialViz-Bench, demonstrating the benchmark's challenge and discriminative power. Their diagnostic analysis identifies that failures primarily arise from perceptual and spatial transformation deficits rather than high-level reasoning issues.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Thinking in space: How multimodal large language models see, remember, and recall spaces PDF

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, Saining Xie, Fei-Fei Li (2025)

[11] STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs PDF

Ma, Zongyang, Zongzhao Li, Li Mingze, Zongyang Ma, Mingze Li, Rong Yu, Songyou Li, Xu, Tingyang, Yu Rong, Zhang, Ziqi, Tingyang Xu, Zhao, Deli, Ziqi Zhang, Huang, Wenbing, Deli Zhao, Wenbing Huang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SpatialViz-Bench: A comprehensive benchmark for spatial visualization

[2] Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models PDF

Cannot Refute

[5] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

Cannot Refute

[12] SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model PDF

Cannot Refute

[19] SpatialBot: Precise Spatial Understanding with Vision Language Models PDF

Cannot Refute

[46] Open3dvqa: A benchmark for comprehensive spatial reasoning with multimodal large language model in open space PDF

Cannot Refute

[50] NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models PDF

Cannot Refute

[61] SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models PDF

Cannot Refute

[62] InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models PDF

Cannot Refute

[63] OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models PDF

Cannot Refute

[64] ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models PDF

Cannot Refute

Contribution

Scalable programmatic generation methodology

[65] Automated recognition of contaminated construction and demolition wood waste using deep learning PDF

Cannot Refute

[66] Hourvideo: 1-hour video-language understanding PDF

Cannot Refute

[67] A novel MAS-GAN-based data synthesis method for object surface defect detection PDF

Cannot Refute

[68] ABBL: An advanced benchmark and leaderboard for comprehensive evaluation of arabic language models PDF

Cannot Refute

[69] Privacy and synthetic datasets PDF

Cannot Refute

[70] Spray quality assessment on water-sensitive paper comparing AI and classical computer vision methods PDF

Cannot Refute

[71] An Automatic Sensitive Image Search System with Generative Artificial Intelligence to Identify Data Leaks on Internet PDF

Cannot Refute

[72] Video question answering with procedural programs PDF

Cannot Refute

[73] Leveraging cochrane systematic literature reviews for prospective evaluation of large language models PDF

Cannot Refute

[74] A Relative Data Diversity Measure for Synthetic Face Images PDF

Cannot Refute

Contribution

Systematic evaluation and diagnostic analysis of 27 MLLMs

[51] Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models PDF

Cannot Refute

[52] CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models PDF

Cannot Refute

[53] Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models PDF

Cannot Refute

[54] Vsp: Assessing the dual challenges of perception and reasoning in spatial planning tasks for vlms PDF

Cannot Refute

[55] What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration? PDF

Cannot Refute

[56] VSP: Diagnosing the Dual Challenges of Perception and Reasoning in Spatial Planning Tasks for MLLMs PDF

Cannot Refute

[57] Spatialviz-bench: An mllm benchmark for spatial visualization PDF

Cannot Refute

[58] EAGLE: Efficient adaptive geometry-based learning in cross-view understanding PDF

Cannot Refute

[59] Do vision-language models really understand visual language? PDF

Cannot Refute

[60] Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning PDF

Cannot Refute

SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Thinking in space: How multimodal large language models see, remember, and recall spaces PDF

[11] STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs PDF

Contribution Analysis

SpatialViz-Bench: A comprehensive benchmark for spatial visualization

[2] Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models PDF

[5] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

[12] SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model PDF

[19] SpatialBot: Precise Spatial Understanding with Vision Language Models PDF

[46] Open3dvqa: A benchmark for comprehensive spatial reasoning with multimodal large language model in open space PDF

[50] NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models PDF

[61] SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models PDF

[62] InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models PDF

[63] OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models PDF

[64] ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models PDF

Scalable programmatic generation methodology

[65] Automated recognition of contaminated construction and demolition wood waste using deep learning PDF

[66] Hourvideo: 1-hour video-language understanding PDF

[67] A novel MAS-GAN-based data synthesis method for object surface defect detection PDF

[68] ABBL: An advanced benchmark and leaderboard for comprehensive evaluation of arabic language models PDF

[69] Privacy and synthetic datasets PDF

[70] Spray quality assessment on water-sensitive paper comparing AI and classical computer vision methods PDF

[71] An Automatic Sensitive Image Search System with Generative Artificial Intelligence to Identify Data Leaks on Internet PDF

[72] Video question answering with procedural programs PDF

[73] Leveraging cochrane systematic literature reviews for prospective evaluation of large language models PDF

[74] A Relative Data Diversity Measure for Synthetic Face Images PDF

Systematic evaluation and diagnostic analysis of 27 MLLMs

[51] Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models PDF

[52] CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models PDF

[53] Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models PDF

[54] Vsp: Assessing the dual challenges of perception and reasoning in spatial planning tasks for vlms PDF

[55] What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration? PDF

[56] VSP: Diagnosing the Dual Challenges of Perception and Reasoning in Spatial Planning Tasks for MLLMs PDF

[57] Spatialviz-bench: An mllm benchmark for spatial visualization PDF

[58] EAGLE: Efficient adaptive geometry-based learning in cross-view understanding PDF

[59] Do vision-language models really understand visual language? PDF

[60] Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning PDF

Table of Contents