SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Multimodal Large Language ModelsSpatial Reasoning
Abstract:

Humans can imagine and manipulate visual images mentally, a capability known as \textit{spatial visualization}. While many multi-modal benchmarks assess reasoning on visible visual information, the ability to infer unseen relationships through spatial visualization remains insufficiently evaluated as a spatial skill. This reliance on publicly sourced problems from IQ tests or math competitions risks data contamination and compromises assessment reliability. To this end, we introduce \textbf{\textit{SpatialViz-Bench}}, a comprehensive multi-modal benchmark for \textit{spatial visualization} with \emph{12} tasks across \emph{4} sub-abilities, comprising \emph{1,180} programmatically generated problems, a scalable framework that allows for expansion to ensure fair and continuously reliable evaluations. Our evaluation of \emph{27} Multi-modal Large Language Models (MLLMs) reveals wide performance variations, demonstrates the benchmark's strong discriminative power, and uncovers counter-intuitive findings: Chain-of-Thought (CoT) prompting paradoxically degrades accuracy on open-source models. Through statistical and qualitative analysis of error types, SpatialViz-Bench demonstrates that state-of-the-art MLLMs exhibit deficiencies in \textit{spatial visualization} tasks, thereby addressing a significant lacuna in the field.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SpatialViz-Bench, a benchmark targeting spatial visualization—the mental manipulation of visual imagery—across twelve tasks organized into four sub-abilities. It resides in the 'Spatial Visualization and Mental Manipulation' leaf of the taxonomy, which contains only three papers total, including this work. This is a relatively sparse research direction compared to neighboring leaves like 'General Spatial Reasoning Evaluation' (six papers) or '3D Scene and Spatial Understanding' (five papers), suggesting the specific focus on mental imagery and unseen spatial transformations is less crowded than broader spatial reasoning evaluation.

The taxonomy reveals that SpatialViz-Bench sits within the broader 'Spatial Reasoning Benchmarks and Evaluation' branch, which encompasses five distinct evaluation paradigms. Neighboring leaves address observable spatial relationships (General Spatial Reasoning), depth-based comprehension (3D Scene Understanding), and multi-view reasoning (Multi-Image Spatial Reasoning). The scope note for this leaf explicitly excludes 'observable spatial relationships or general reasoning,' positioning the work at the boundary between perceptual spatial tasks and higher-order mental transformation. This placement suggests the paper targets a capability gap between static visual understanding and dynamic spatial manipulation that sibling categories do not directly address.

Among thirty candidates examined, none were found to clearly refute any of the three contributions. The benchmark contribution examined ten candidates with zero refutable matches, as did the programmatic generation methodology and the systematic evaluation of twenty-seven models. This absence of overlapping prior work across all contributions, given the limited search scope, suggests the specific combination of spatial visualization focus, programmatic scalability, and comprehensive model diagnostics may be relatively unexplored. However, the search examined only top-K semantic matches and citations, not an exhaustive literature review, so undetected overlaps remain possible.

Based on the limited thirty-candidate search, the work appears to occupy a distinct position within spatial reasoning evaluation, particularly in its emphasis on mental manipulation rather than observable relationships. The sparse population of its taxonomy leaf and the absence of refuting candidates across all contributions suggest novelty, though the restricted search scope means this assessment reflects only the most semantically similar prior work. The counter-intuitive finding about Chain-of-Thought degradation on open-source models hints at diagnostic value beyond benchmark construction itself.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: spatial visualization in multi-modal large language models. The field has organized itself around several complementary directions. Spatial Reasoning Benchmarks and Evaluation focuses on designing tasks that probe models' abilities to understand and manipulate spatial information, ranging from mental rotation and visualization challenges to grounded spatial understanding in images and 3D scenes. Methods for Enhancing Spatial Reasoning explores architectural innovations, training strategies, and augmentation techniques that improve spatial capabilities, while Analysis and Limitations of Spatial Capabilities investigates where and why current models fail. Domain-Specific Spatial Applications adapts spatial reasoning to specialized contexts such as robotics, navigation, and medical imaging, and Surveys and Comprehensive Reviews synthesize progress across these areas. Auxiliary Techniques and Optimization addresses broader infrastructure concerns like efficiency and token compression that enable practical deployment of spatially-aware models. Within the benchmark landscape, a particularly active thread examines spatial visualization and mental manipulation—tasks requiring models to imagine transformations or reason about unseen perspectives. SpatialViz-Bench[0] situates itself in this cluster, emphasizing rigorous evaluation of visualization capabilities that go beyond static spatial relationships. Nearby works like Thinking in Space[1] and STAR-R1[11] similarly probe mental rotation and dynamic spatial reasoning, though they may differ in task granularity or the types of transformations tested. Earlier efforts such as Visual Spatial Reasoning[3] laid foundational evaluation paradigms, while recent benchmarks like Visulogic[2] and Visfactor[4] have expanded the scope to include logical spatial inference and factor-based decomposition. A central open question across these studies is whether models genuinely perform internal spatial transformations or rely on pattern matching, and how evaluation design can distinguish these mechanisms. SpatialViz-Bench[0] contributes to this ongoing conversation by targeting specific visualization competencies that reveal deeper spatial understanding.

Claimed Contributions

SpatialViz-Bench: A comprehensive benchmark for spatial visualization

The authors present SpatialViz-Bench, a novel benchmark designed to formally evaluate spatial visualization capabilities of MLLMs. It is grounded in cognitive science and assesses 4 key sub-abilities through 12 distinct tasks, resulting in 1,180 examples across parameter-controlled difficulty levels.

10 retrieved papers
Scalable programmatic generation methodology

The authors develop a programmatic generation pipeline integrating Python with FreeCAD to create novel test cases. This methodology allows scalable task expansion and prevents data contamination by dynamically updating the test bank through randomized generation.

10 retrieved papers
Systematic evaluation and diagnostic analysis of 27 MLLMs

The authors conduct a comprehensive evaluation of 27 MLLMs on SpatialViz-Bench, demonstrating the benchmark's challenge and discriminative power. Their diagnostic analysis identifies that failures primarily arise from perceptual and spatial transformation deficits rather than high-level reasoning issues.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SpatialViz-Bench: A comprehensive benchmark for spatial visualization

The authors present SpatialViz-Bench, a novel benchmark designed to formally evaluate spatial visualization capabilities of MLLMs. It is grounded in cognitive science and assesses 4 key sub-abilities through 12 distinct tasks, resulting in 1,180 examples across parameter-controlled difficulty levels.

Contribution

Scalable programmatic generation methodology

The authors develop a programmatic generation pipeline integrating Python with FreeCAD to create novel test cases. This methodology allows scalable task expansion and prevents data contamination by dynamically updating the test bank through randomized generation.

Contribution

Systematic evaluation and diagnostic analysis of 27 MLLMs

The authors conduct a comprehensive evaluation of 27 MLLMs on SpatialViz-Bench, demonstrating the benchmark's challenge and discriminative power. Their diagnostic analysis identifies that failures primarily arise from perceptual and spatial transformation deficits rather than high-level reasoning issues.