SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Spatial reasoningVLMsbenchmark
Abstract:

We present SpinBench, a cognitively grounded diagnostic benchmark for evaluating spatial reasoning in vision language models (VLMs). SpinBench is designed around the core challenge of spatial reasoning: perspective taking, the ability to reason about how scenes and object relations change under viewpoint transformation. Since perspective taking requires multiple cognitive capabilities, such as recognizing objects across views, relative positions grounding, and mentally simulating transformations, SpinBench introduces a set of fine-grained diagnostic categories. Our categories target translation, rotation, object relative pose, and viewpoint change, and are progressively structured so that single-object simpler tasks scaffold toward the most demanding multi-object perspective-taking setting. We evaluate 37 state-of-the-art VLMs, both proprietary and open source. Results reveal systematic weaknesses: strong egocentric bias, poor rotational understanding, and inconsistencies under symmetrical and syntactic reformulations. Scaling analysis shows both smooth improvements and emergent capabilities. While human subjects achieve high accuracy (91.2%), task difficulty as measured by human response time shows strong correlation with VLM accuracy, indicating that SpinBench captures spatial reasoning challenges shared across humans and VLMs. Together, our findings highlight the need for structured, cognitively inspired diagnostic tools to advance spatial reasoning in multimodal foundation models. Our website can be found here.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

SpinBench introduces a cognitively grounded diagnostic benchmark targeting perspective-taking and viewpoint transformation in vision-language models. The taxonomy places this work in a dedicated leaf node under Spatial Reasoning Evaluation and Benchmarking, specifically within Perspective-Taking and Viewpoint Transformation. Notably, this leaf contains only one paper—SpinBench itself—indicating that perspective-taking as a focused evaluation target remains relatively underexplored compared to broader spatial relationship benchmarks or egocentric spatial tasks, which populate neighboring leaves with multiple entries.

The taxonomy reveals that SpinBench occupies a specialized niche within a broader evaluation landscape. Adjacent leaves include General Spatial Relationship and Reasoning Benchmarks (three papers assessing diverse spatial relations), Egocentric and Embodied Spatial Benchmarks (two papers on agent-centric viewpoints), and Occlusion and Pattern-Based Spatial Reasoning (one paper on occluded objects). While these neighboring categories address static spatial relations or agent-specific perspectives, SpinBench's emphasis on viewpoint transformation and rotational understanding distinguishes it from general spatial assessments and egocentric tasks that do not systematically vary observer position.

Among thirty candidates examined through semantic search and citation expansion, none clearly refuted any of SpinBench's three core contributions: the cognitively grounded diagnostic benchmark itself, the controlled variation framework for diagnostic evaluation, and the multi-domain dataset combining real-world and synthetic data. Each contribution was assessed against ten candidates, with zero refutable overlaps identified. This suggests that within the limited search scope, no prior work directly provides the same combination of perspective-taking diagnostics, fine-grained cognitive scaffolding, and multi-domain coverage that SpinBench offers.

The analysis reflects a targeted literature search rather than an exhaustive survey of all spatial reasoning benchmarks. The absence of sibling papers in the same taxonomy leaf and the zero-refutation outcome across thirty candidates indicate that SpinBench addresses a gap in perspective-taking evaluation, though the search scope does not capture every possible related benchmark or dataset. The work's positioning in a sparse leaf suggests it pioneers a specific diagnostic direction within the broader spatial reasoning evaluation landscape.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: spatial reasoning in vision language models. The field has organized itself around several complementary branches that address different facets of how multimodal systems understand and manipulate spatial information. Spatial Representation Enhancement Methods focus on enriching the internal representations that models use to encode geometric and positional cues, while Spatial Reasoning Training and Data Curation emphasizes the creation of specialized datasets and training regimes to improve spatial competence. Spatial Reasoning Mechanisms and Inference Methods explore algorithmic strategies—such as chain-of-thought prompting or intermediate visual reasoning steps—that help models perform multi-step spatial inferences. Spatial Reasoning Evaluation and Benchmarking develops diagnostic tests and metrics to measure capabilities like perspective-taking, viewpoint transformation, and relational understanding, with works such as SpinBench[0] and EmbSpatial-Bench[45] providing targeted assessments. Domain-Specific Spatial Reasoning Applications adapt these techniques to robotics, navigation, and embodied AI, while Multimodal Alignment and Representation Learning and General Vision-Language Model Capabilities and Surveys address broader architectural and alignment questions that underpin spatial understanding. Recent efforts reveal a tension between specialized spatial modules and end-to-end learned representations. Some lines of work, including SpatialRGPT[1] and SpatialBot[5], introduce explicit spatial encodings or auxiliary reasoning pathways to handle complex geometric queries, whereas others like Visual Cognition[4] and Multimodal LLM Survey[3] examine how general-purpose vision-language models can be fine-tuned or prompted to exhibit spatial competence without heavy architectural changes. SpinBench[0] sits squarely within the evaluation branch focused on perspective-taking and viewpoint transformation, providing a benchmark that complements broader diagnostic suites like MMIU[26] and aligns closely with works such as Egocentric Spatial Reasoning[8] that probe how models handle observer-relative spatial relations. By systematically testing viewpoint shifts, SpinBench[0] highlights gaps that training-focused approaches like SpatialVLM[15] and data-centric methods aim to address, underscoring the ongoing challenge of building models that robustly generalize across diverse spatial contexts.

Claimed Contributions

SpinBench: A cognitively grounded diagnostic benchmark for spatial reasoning in VLMs

The authors introduce SpinBench, a benchmark designed around perspective taking that decomposes spatial reasoning into fine-grained diagnostic categories including translation, rotation, object relative pose, and viewpoint change. The benchmark uses progressively structured tasks that scaffold from single-object tasks to multi-object perspective-taking settings.

10 retrieved papers
Controlled variation framework for diagnostic evaluation

The authors develop a systematic framework of controlled variations including reference frame manipulations (allocentric vs. egocentric), premise-based question structures, and syntactic and symmetrical augmentations. This framework enables precise diagnosis of model failures and biases in spatial reasoning.

10 retrieved papers
Multi-domain dataset combining real-world and synthetic data

The authors construct a benchmark dataset spanning four visual domains (Infinigen synthetic scenes, ABO household objects, cars, and human faces) with 51 tasks totaling 2.7k samples. The dataset ensures domain diversity and real-world relevance while maintaining evaluation rigor through controlled generation and annotation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SpinBench: A cognitively grounded diagnostic benchmark for spatial reasoning in VLMs

The authors introduce SpinBench, a benchmark designed around perspective taking that decomposes spatial reasoning into fine-grained diagnostic categories including translation, rotation, object relative pose, and viewpoint change. The benchmark uses progressively structured tasks that scaffold from single-object tasks to multi-object perspective-taking settings.

Contribution

Controlled variation framework for diagnostic evaluation

The authors develop a systematic framework of controlled variations including reference frame manipulations (allocentric vs. egocentric), premise-based question structures, and syntactic and symmetrical augmentations. This framework enables precise diagnosis of model failures and biases in spatial reasoning.

Contribution

Multi-domain dataset combining real-world and synthetic data

The authors construct a benchmark dataset spanning four visual domains (Infinigen synthetic scenes, ABO household objects, cars, and human faces) with 51 tasks totaling 2.7k samples. The dataset ensures domain diversity and real-world relevance while maintaining evaluation rigor through controlled generation and annotation.