SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

Spatial reasoningVLMsbenchmark

We present SpinBench, a cognitively grounded diagnostic benchmark for evaluating spatial reasoning in vision language models (VLMs). SpinBench is designed around the core challenge of spatial reasoning: perspective taking, the ability to reason about how scenes and object relations change under viewpoint transformation. Since perspective taking requires multiple cognitive capabilities, such as recognizing objects across views, relative positions grounding, and mentally simulating transformations, SpinBench introduces a set of fine-grained diagnostic categories. Our categories target translation, rotation, object relative pose, and viewpoint change, and are progressively structured so that single-object simpler tasks scaffold toward the most demanding multi-object perspective-taking setting. We evaluate 37 state-of-the-art VLMs, both proprietary and open source. Results reveal systematic weaknesses: strong egocentric bias, poor rotational understanding, and inconsistencies under symmetrical and syntactic reformulations. Scaling analysis shows both smooth improvements and emergent capabilities. While human subjects achieve high accuracy (91.2%), task difficulty as measured by human response time shows strong correlation with VLM accuracy, indicating that SpinBench captures spatial reasoning challenges shared across humans and VLMs. Together, our findings highlight the need for structured, cognitively inspired diagnostic tools to advance spatial reasoning in multimodal foundation models. Our website can be found here.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

SpinBench introduces a cognitively grounded diagnostic benchmark targeting perspective-taking and viewpoint transformation in vision-language models. The taxonomy places this work in a dedicated leaf node under Spatial Reasoning Evaluation and Benchmarking, specifically within Perspective-Taking and Viewpoint Transformation. Notably, this leaf contains only one paper—SpinBench itself—indicating that perspective-taking as a focused evaluation target remains relatively underexplored compared to broader spatial relationship benchmarks or egocentric spatial tasks, which populate neighboring leaves with multiple entries.

The taxonomy reveals that SpinBench occupies a specialized niche within a broader evaluation landscape. Adjacent leaves include General Spatial Relationship and Reasoning Benchmarks (three papers assessing diverse spatial relations), Egocentric and Embodied Spatial Benchmarks (two papers on agent-centric viewpoints), and Occlusion and Pattern-Based Spatial Reasoning (one paper on occluded objects). While these neighboring categories address static spatial relations or agent-specific perspectives, SpinBench's emphasis on viewpoint transformation and rotational understanding distinguishes it from general spatial assessments and egocentric tasks that do not systematically vary observer position.

Among thirty candidates examined through semantic search and citation expansion, none clearly refuted any of SpinBench's three core contributions: the cognitively grounded diagnostic benchmark itself, the controlled variation framework for diagnostic evaluation, and the multi-domain dataset combining real-world and synthetic data. Each contribution was assessed against ten candidates, with zero refutable overlaps identified. This suggests that within the limited search scope, no prior work directly provides the same combination of perspective-taking diagnostics, fine-grained cognitive scaffolding, and multi-domain coverage that SpinBench offers.

The analysis reflects a targeted literature search rather than an exhaustive survey of all spatial reasoning benchmarks. The absence of sibling papers in the same taxonomy leaf and the zero-refutation outcome across thirty candidates indicate that SpinBench addresses a gap in perspective-taking evaluation, though the search scope does not capture every possible related benchmark or dataset. The work's positioning in a sparse leaf suggests it pioneers a specific diagnostic direction within the broader spatial reasoning evaluation landscape.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: spatial reasoning in vision language models. The field has organized itself around several complementary branches that address different facets of how multimodal systems understand and manipulate spatial information. Spatial Representation Enhancement Methods focus on enriching the internal representations that models use to encode geometric and positional cues, while Spatial Reasoning Training and Data Curation emphasizes the creation of specialized datasets and training regimes to improve spatial competence. Spatial Reasoning Mechanisms and Inference Methods explore algorithmic strategies—such as chain-of-thought prompting or intermediate visual reasoning steps—that help models perform multi-step spatial inferences. Spatial Reasoning Evaluation and Benchmarking develops diagnostic tests and metrics to measure capabilities like perspective-taking, viewpoint transformation, and relational understanding, with works such as SpinBench[0] and EmbSpatial-Bench[45] providing targeted assessments. Domain-Specific Spatial Reasoning Applications adapt these techniques to robotics, navigation, and embodied AI, while Multimodal Alignment and Representation Learning and General Vision-Language Model Capabilities and Surveys address broader architectural and alignment questions that underpin spatial understanding. Recent efforts reveal a tension between specialized spatial modules and end-to-end learned representations. Some lines of work, including SpatialRGPT[1] and SpatialBot[5], introduce explicit spatial encodings or auxiliary reasoning pathways to handle complex geometric queries, whereas others like Visual Cognition[4] and Multimodal LLM Survey[3] examine how general-purpose vision-language models can be fine-tuned or prompted to exhibit spatial competence without heavy architectural changes. SpinBench[0] sits squarely within the evaluation branch focused on perspective-taking and viewpoint transformation, providing a benchmark that complements broader diagnostic suites like MMIU[26] and aligns closely with works such as Egocentric Spatial Reasoning[8] that probe how models handle observer-relative spatial relations. By systematically testing viewpoint shifts, SpinBench[0] highlights gaps that training-focused approaches like SpatialVLM[15] and data-centric methods aim to address, underscoring the ongoing challenge of building models that robustly generalize across diverse spatial contexts.

Claimed Contributions

SpinBench: A cognitively grounded diagnostic benchmark for spatial reasoning in VLMs

10 retrieved papers

The authors introduce SpinBench, a benchmark designed around perspective taking that decomposes spatial reasoning into fine-grained diagnostic categories including translation, rotation, object relative pose, and viewpoint change. The benchmark uses progressively structured tasks that scaffold from single-object tasks to multi-object perspective-taking settings.

10 retrieved papers

Controlled variation framework for diagnostic evaluation

10 retrieved papers

The authors develop a systematic framework of controlled variations including reference frame manipulations (allocentric vs. egocentric), premise-based question structures, and syntactic and symmetrical augmentations. This framework enables precise diagnosis of model failures and biases in spatial reasoning.

10 retrieved papers

Multi-domain dataset combining real-world and synthetic data

10 retrieved papers

The authors construct a benchmark dataset spanning four visual domains (Infinigen synthetic scenes, ABO household objects, cars, and human faces) with 51 tasks totaling 2.7k samples. The dataset ensures domain diversity and real-world relevance while maintaining evaluation rigor through controlled generation and annotation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SpinBench: A cognitively grounded diagnostic benchmark for spatial reasoning in VLMs

[1] SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model PDF

Cannot Refute

[2] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

Cannot Refute

[13] Towards grounded visual spatial reasoning in multi-modal vision language models PDF

Cannot Refute

[24] SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models PDF

Cannot Refute

[28] InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models PDF

Cannot Refute

[30] Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models PDF

Cannot Refute

[67] Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Mutimodal Models PDF

Cannot Refute

[68] Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models PDF

Cannot Refute

[69] Reasoning paths with reference objects elicit quantitative spatial reasoning in large vision-language models PDF

Cannot Refute

[70] Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models PDF

Cannot Refute

Contribution

Controlled variation framework for diagnostic evaluation

[51] Converting an allocentric goal into an egocentric steering signal PDF

Cannot Refute

[52] The Role of Temporal Order in Egocentric and Allocentric Spatial Representations PDF

Cannot Refute

[53] Allocentric and egocentric updating of spatial memories. PDF

Cannot Refute

[54] Orientational manoeuvres in the dark: dissociating allocentric and egocentric influences on spatial memory PDF

Cannot Refute

[55] Cortical Correlates of Visuospatial Switching Processes Between Egocentric and Allocentric Frames of Reference: A fNIRS Study PDF

Cannot Refute

[56] How ageing and blindness affect egocentric and allocentric spatial memory PDF

Cannot Refute

[57] Egocentric and Allocentric Spatial Memory for Body Parts: A Virtual Reality Study PDF

Cannot Refute

[58] Behavioral investigation of allocentric and egocentric cognitive maps in human spatial memory PDF

Cannot Refute

[59] Navigation task and action space drive the emergence of egocentric and allocentric spatial representations PDF

Cannot Refute

[60] Timing of Allocentric and Egocentric Spatial Processing in Human Intracranial EEG PDF

Cannot Refute

Contribution

Multi-domain dataset combining real-world and synthetic data

[20] SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data PDF

Cannot Refute

[29] Mind the gap: Benchmarking spatial reasoning in vision-language models PDF

Cannot Refute

[34] CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting PDF

Cannot Refute

[40] Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing PDF

Cannot Refute

[61] MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning PDF

Cannot Refute

[62] The ECOLOPES Voxel Model: Multi-domain data integration for ontology-aided generative computational design of ecological building envelopes PDF

Cannot Refute

[63] Garf: Learning generalizable 3d reassembly for real-world fractures PDF

Cannot Refute

[64] Robotwin: Dual-arm robot benchmark with generative digital twins PDF

Cannot Refute

[65] Vision-g1: Towards general vision language reasoning with multi-domain data curation PDF

Cannot Refute

[66] TU-DAT: A Computer Vision Dataset on Road Traffic Anomalies PDF

Cannot Refute

SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

SpinBench: A cognitively grounded diagnostic benchmark for spatial reasoning in VLMs

[1] SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model PDF

[2] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

[13] Towards grounded visual spatial reasoning in multi-modal vision language models PDF

[24] SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models PDF

[28] InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models PDF

[30] Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models PDF

[67] Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Mutimodal Models PDF

[68] Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models PDF

[69] Reasoning paths with reference objects elicit quantitative spatial reasoning in large vision-language models PDF

[70] Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models PDF

Controlled variation framework for diagnostic evaluation

[51] Converting an allocentric goal into an egocentric steering signal PDF

[52] The Role of Temporal Order in Egocentric and Allocentric Spatial Representations PDF

[53] Allocentric and egocentric updating of spatial memories. PDF

[54] Orientational manoeuvres in the dark: dissociating allocentric and egocentric influences on spatial memory PDF

[55] Cortical Correlates of Visuospatial Switching Processes Between Egocentric and Allocentric Frames of Reference: A fNIRS Study PDF

[56] How ageing and blindness affect egocentric and allocentric spatial memory PDF

[57] Egocentric and Allocentric Spatial Memory for Body Parts: A Virtual Reality Study PDF

[58] Behavioral investigation of allocentric and egocentric cognitive maps in human spatial memory PDF

[59] Navigation task and action space drive the emergence of egocentric and allocentric spatial representations PDF

[60] Timing of Allocentric and Egocentric Spatial Processing in Human Intracranial EEG PDF

Multi-domain dataset combining real-world and synthetic data

[20] SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data PDF

[29] Mind the gap: Benchmarking spatial reasoning in vision-language models PDF

[34] CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting PDF

[40] Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing PDF

[61] MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning PDF

[62] The ECOLOPES Voxel Model: Multi-domain data integration for ontology-aided generative computational design of ecological building envelopes PDF

[63] Garf: Learning generalizable 3d reassembly for real-world fractures PDF

[64] Robotwin: Dual-arm robot benchmark with generative digital twins PDF

[65] Vision-g1: Towards general vision language reasoning with multi-domain data curation PDF

[66] TU-DAT: A Computer Vision Dataset on Road Traffic Anomalies PDF

Table of Contents