GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra

ICLR 2026 Conference SubmissionAnonymous Authors
Geometric ReasoningBenchmarkingFoundation Models
Abstract:

Monocular 3D reconstruction methods and vision-language models (VLMs) demonstrate impressive results on standard benchmarks, yet their true understanding of geometric properties remains unclear. We introduce GIQ, a comprehensive benchmark specifically designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models. GIQ comprises synthetic and real-world images and corresponding 3D meshes of diverse polyhedra—including Platonic, Archimedean, Johnson, and Catalan solids, as well as stellations and compound shapes—covering varying levels of complexity and symmetry. Through systematic experiments involving monocular 3D reconstruction, 3D symmetry detection, mental rotation tests, and zero-shot shape classification tasks, we reveal significant shortcomings in current models. State-of-the-art reconstruction algorithms trained on extensive 3D datasets struggle to reconstruct even basic geometric forms accurately. While foundation models effectively detect specific 3D symmetry elements via non-linear probing, they falter significantly in tasks requiring detailed geometric differentiation, such as mental rotation. Moreover, advanced vision-language assistants exhibit remarkably low accuracy on complex polyhedra, systematically misinterpreting basic properties like face geometry, convexity, and compound structures. GIQ will be publicly available , providing a structured platform to highlight and address critical gaps in geometric intelligence, facilitating future progress in robust, geometry-aware representation learning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces GIQ, a benchmark dataset and evaluation framework targeting geometric reasoning in vision and vision-language foundation models. It resides in the '3D Geometric and Structural Understanding' leaf, which contains five papers total, including works on probing 3D awareness, visual attribute benchmarks, and embodied 3D evaluation. This leaf sits within the broader 'Spatial Reasoning Capabilities and Evaluation' branch, indicating a moderately populated research direction focused on diagnostic assessment rather than method development. The taxonomy reveals that geometric reasoning evaluation is an active but not overcrowded area, with distinct clusters for general spatial reasoning, embodied intelligence, and domain-specific tasks.

The taxonomy tree shows neighboring leaves addressing general spatial reasoning benchmarks (six papers on broad relationship understanding) and embodied/robotic spatial intelligence (three papers on egocentric tasks). GIQ's focus on polyhedra, symmetry detection, and mental rotation distinguishes it from these adjacent directions: general spatial benchmarks emphasize 2D relationships and orientation, while embodied benchmarks prioritize navigation and manipulation. The taxonomy's scope notes clarify that 3D geometric property recognition belongs specifically in this leaf, separating it from 2D spatial reasoning or action-oriented evaluation. This structural positioning suggests GIQ addresses a gap between abstract geometric understanding and task-driven spatial intelligence.

Among twenty candidates examined across three contributions, zero refutable pairs were identified. The first contribution (GIQ dataset) examined ten candidates with no clear refutations, while the second contribution (evaluation framework) examined zero candidates directly. The third contribution (empirical findings) also examined ten candidates without refutation. This limited search scope—twenty papers from semantic retrieval—means the analysis captures top-ranked related work but cannot claim exhaustive coverage. The absence of refutations among examined candidates suggests the specific combination of polyhedra-focused tasks and systematic geometric probing may represent a novel evaluation angle, though the small search window leaves room for undetected overlaps.

Based on the top-twenty semantic matches and taxonomy context, GIQ appears to occupy a distinct position within 3D geometric evaluation, emphasizing structural properties and symmetry rather than scene-level reconstruction or embodied tasks. The limited search scope and zero refutations among examined candidates suggest potential novelty, but a broader literature review would be needed to confirm whether similar polyhedra-based benchmarks or mental rotation tests exist outside the retrieved set. The taxonomy structure indicates this work contributes to an active but not saturated research direction.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
20
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: benchmarking 3D geometric reasoning of vision foundation models. The field has organized itself around four main branches that reflect different facets of spatial intelligence research. Spatial Reasoning Capabilities and Evaluation focuses on diagnostic benchmarks and probes that measure how well models understand geometric properties, depth relations, and structural configurations, with works like Probing 3D Awareness[1] and AVA-Bench[8] establishing evaluation protocols. Methods for Enhancing Spatial Reasoning explores training strategies and architectural innovations—such as SpatialRGPT[2], SpatialBot[4], and RoboSpatial[5]—that aim to improve models' ability to reason about 3D space through specialized data or learning objectives. Multimodal and 3D Foundation Models examines large-scale pretrained systems that integrate vision, language, and sometimes point clouds or depth, exemplified by efforts like Scaling Spatial Intelligence[3] and RGB-D Transformers[9]. Finally, Specialized Applications and Domains investigates how spatial reasoning transfers to robotics, navigation, geospatial analysis, and other task-specific contexts, highlighting the practical deployment challenges of these capabilities. A central tension across these branches concerns whether spatial understanding emerges from scale and diverse pretraining or requires targeted geometric supervision and structured representations. Many studies reveal that even powerful vision-language models struggle with fine-grained depth ordering and metric reasoning, prompting a wave of specialized benchmarks like E3D-Bench[19] and CompareBench[11] that isolate particular geometric competencies. GIQ[0] sits squarely within the 3D Geometric and Structural Understanding cluster, emphasizing rigorous evaluation of how foundation models parse structural properties from images. Compared to neighbors such as Probing 3D Awareness[1], which dissects internal representations, and AVA-Bench[8], which targets broader visual attributes, GIQ[0] zeroes in on geometric intelligence as a distinct dimension, complementing efforts like E3D-Bench[19] that also stress embodied 3D tasks. This positioning underscores an ongoing shift toward decomposing spatial reasoning into measurable sub-skills rather than treating it as a monolithic capability.

Claimed Contributions

GIQ benchmark dataset for evaluating geometric reasoning

The authors present GIQ, a novel benchmark dataset comprising synthetic and real-world images of 224 diverse polyhedra with corresponding 3D meshes. The dataset systematically varies geometric complexity, symmetry properties, and topological regularity to enable rigorous evaluation of spatial reasoning in vision models.

10 retrieved papers
Systematic evaluation framework across four geometric reasoning tasks

The authors develop a comprehensive evaluation framework consisting of four distinct tasks that probe different dimensions of geometric intelligence: explicit 3D reconstruction, implicit symmetry detection via linear and non-linear probing, mental rotation capabilities, and high-level semantic classification by frontier vision-language models.

0 retrieved papers
Empirical findings revealing fundamental gaps in geometric understanding

The authors demonstrate through extensive experiments that current state-of-the-art models exhibit significant limitations in geometric reasoning, including failures in reconstructing simple shapes, struggles with mental rotation tasks requiring fine-grained differentiation, and remarkably low accuracy in interpreting basic shape properties by advanced vision-language assistants.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

GIQ benchmark dataset for evaluating geometric reasoning

The authors present GIQ, a novel benchmark dataset comprising synthetic and real-world images of 224 diverse polyhedra with corresponding 3D meshes. The dataset systematically varies geometric complexity, symmetry properties, and topological regularity to enable rigorous evaluation of spatial reasoning in vision models.

Contribution

Systematic evaluation framework across four geometric reasoning tasks

The authors develop a comprehensive evaluation framework consisting of four distinct tasks that probe different dimensions of geometric intelligence: explicit 3D reconstruction, implicit symmetry detection via linear and non-linear probing, mental rotation capabilities, and high-level semantic classification by frontier vision-language models.

Contribution

Empirical findings revealing fundamental gaps in geometric understanding

The authors demonstrate through extensive experiments that current state-of-the-art models exhibit significant limitations in geometric reasoning, including failures in reconstructing simple shapes, struggles with mental rotation tasks requiring fine-grained differentiation, and remarkably low accuracy in interpreting basic shape properties by advanced vision-language assistants.

GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra | Novelty Validation