GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Geometric ReasoningBenchmarkingFoundation Models

Monocular 3D reconstruction methods and vision-language models (VLMs) demonstrate impressive results on standard benchmarks, yet their true understanding of geometric properties remains unclear. We introduce GIQ, a comprehensive benchmark specifically designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models. GIQ comprises synthetic and real-world images and corresponding 3D meshes of diverse polyhedra—including Platonic, Archimedean, Johnson, and Catalan solids, as well as stellations and compound shapes—covering varying levels of complexity and symmetry. Through systematic experiments involving monocular 3D reconstruction, 3D symmetry detection, mental rotation tests, and zero-shot shape classification tasks, we reveal significant shortcomings in current models. State-of-the-art reconstruction algorithms trained on extensive 3D datasets struggle to reconstruct even basic geometric forms accurately. While foundation models effectively detect specific 3D symmetry elements via non-linear probing, they falter significantly in tasks requiring detailed geometric differentiation, such as mental rotation. Moreover, advanced vision-language assistants exhibit remarkably low accuracy on complex polyhedra, systematically misinterpreting basic properties like face geometry, convexity, and compound structures. GIQ will be publicly available , providing a structured platform to highlight and address critical gaps in geometric intelligence, facilitating future progress in robust, geometry-aware representation learning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces GIQ, a benchmark dataset and evaluation framework targeting geometric reasoning in vision and vision-language foundation models. It resides in the '3D Geometric and Structural Understanding' leaf, which contains five papers total, including works on probing 3D awareness, visual attribute benchmarks, and embodied 3D evaluation. This leaf sits within the broader 'Spatial Reasoning Capabilities and Evaluation' branch, indicating a moderately populated research direction focused on diagnostic assessment rather than method development. The taxonomy reveals that geometric reasoning evaluation is an active but not overcrowded area, with distinct clusters for general spatial reasoning, embodied intelligence, and domain-specific tasks.

The taxonomy tree shows neighboring leaves addressing general spatial reasoning benchmarks (six papers on broad relationship understanding) and embodied/robotic spatial intelligence (three papers on egocentric tasks). GIQ's focus on polyhedra, symmetry detection, and mental rotation distinguishes it from these adjacent directions: general spatial benchmarks emphasize 2D relationships and orientation, while embodied benchmarks prioritize navigation and manipulation. The taxonomy's scope notes clarify that 3D geometric property recognition belongs specifically in this leaf, separating it from 2D spatial reasoning or action-oriented evaluation. This structural positioning suggests GIQ addresses a gap between abstract geometric understanding and task-driven spatial intelligence.

Among twenty candidates examined across three contributions, zero refutable pairs were identified. The first contribution (GIQ dataset) examined ten candidates with no clear refutations, while the second contribution (evaluation framework) examined zero candidates directly. The third contribution (empirical findings) also examined ten candidates without refutation. This limited search scope—twenty papers from semantic retrieval—means the analysis captures top-ranked related work but cannot claim exhaustive coverage. The absence of refutations among examined candidates suggests the specific combination of polyhedra-focused tasks and systematic geometric probing may represent a novel evaluation angle, though the small search window leaves room for undetected overlaps.

Based on the top-twenty semantic matches and taxonomy context, GIQ appears to occupy a distinct position within 3D geometric evaluation, emphasizing structural properties and symmetry rather than scene-level reconstruction or embodied tasks. The limited search scope and zero refutations among examined candidates suggest potential novelty, but a broader literature review would be needed to confirm whether similar polyhedra-based benchmarks or mental rotation tests exist outside the retrieved set. The taxonomy structure indicates this work contributes to an active but not saturated research direction.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: benchmarking 3D geometric reasoning of vision foundation models. The field has organized itself around four main branches that reflect different facets of spatial intelligence research. Spatial Reasoning Capabilities and Evaluation focuses on diagnostic benchmarks and probes that measure how well models understand geometric properties, depth relations, and structural configurations, with works like Probing 3D Awareness[1] and AVA-Bench[8] establishing evaluation protocols. Methods for Enhancing Spatial Reasoning explores training strategies and architectural innovations—such as SpatialRGPT[2], SpatialBot[4], and RoboSpatial[5]—that aim to improve models' ability to reason about 3D space through specialized data or learning objectives. Multimodal and 3D Foundation Models examines large-scale pretrained systems that integrate vision, language, and sometimes point clouds or depth, exemplified by efforts like Scaling Spatial Intelligence[3] and RGB-D Transformers[9]. Finally, Specialized Applications and Domains investigates how spatial reasoning transfers to robotics, navigation, geospatial analysis, and other task-specific contexts, highlighting the practical deployment challenges of these capabilities. A central tension across these branches concerns whether spatial understanding emerges from scale and diverse pretraining or requires targeted geometric supervision and structured representations. Many studies reveal that even powerful vision-language models struggle with fine-grained depth ordering and metric reasoning, prompting a wave of specialized benchmarks like E3D-Bench[19] and CompareBench[11] that isolate particular geometric competencies. GIQ[0] sits squarely within the 3D Geometric and Structural Understanding cluster, emphasizing rigorous evaluation of how foundation models parse structural properties from images. Compared to neighbors such as Probing 3D Awareness[1], which dissects internal representations, and AVA-Bench[8], which targets broader visual attributes, GIQ[0] zeroes in on geometric intelligence as a distinct dimension, complementing efforts like E3D-Bench[19] that also stress embodied 3D tasks. This positioning underscores an ongoing shift toward decomposing spatial reasoning into measurable sub-skills rather than treating it as a monolithic capability.

Claimed Contributions

GIQ benchmark dataset for evaluating geometric reasoning

10 retrieved papers

The authors present GIQ, a novel benchmark dataset comprising synthetic and real-world images of 224 diverse polyhedra with corresponding 3D meshes. The dataset systematically varies geometric complexity, symmetry properties, and topological regularity to enable rigorous evaluation of spatial reasoning in vision models.

10 retrieved papers

Systematic evaluation framework across four geometric reasoning tasks

0 retrieved papers

The authors develop a comprehensive evaluation framework consisting of four distinct tasks that probe different dimensions of geometric intelligence: explicit 3D reconstruction, implicit symmetry detection via linear and non-linear probing, mental rotation capabilities, and high-level semantic classification by frontier vision-language models.

0 retrieved papers

Empirical findings revealing fundamental gaps in geometric understanding

10 retrieved papers

The authors demonstrate through extensive experiments that current state-of-the-art models exhibit significant limitations in geometric reasoning, including failures in reconstructing simple shapes, struggles with mental rotation tasks requiring fine-grained differentiation, and remarkably low accuracy in interpreting basic shape properties by advanced vision-language assistants.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Probing the 3d awareness of visual foundation models PDF

Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, Leonidas J. Guibas, Varun Jampani (2024)

[8] Ava-bench: Atomic visual ability benchmark for vision foundation models PDF

Mai, Zheda, Zheda Mai, Wang, Zihe, Arpita Chowdhury, Zihe Wang, Wang Lemeng, Sooyoung Jeon, Lemeng Wang, Chao, Wei-Lun, Jiacheng Hou, Jihyung Kil, Wei-Lun Chao (2025)

[19] E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models PDF

Cong, Wenyan, Liang Yiqing, Wenyan Cong, Zhang Yancheng, Yiqing Liang, Yang Ziyi, Yancheng Zhang, Wang Yan, Ziyi Yang, Ivanovic, Boris, Yan Wang, Pavone, Marco, B. Ivanovic, Chen Chen, Marco Pavone, Wang, Zhangyang, Fan Zhiwen, Zhangyang Wang, Zhiwen Fan (2025) • arXiv.org

[32] Shape and texture recognition in large vision-language models PDF

Eppel, Sagi, Sagi Eppel, Mor Bismut, Alona Faktor (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

GIQ benchmark dataset for evaluating geometric reasoning

[2] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

Cannot Refute

[14] SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model PDF

Cannot Refute

[22] SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities PDF

Cannot Refute

[29] Mind the gap: Benchmarking spatial reasoning in vision-language models PDF

Cannot Refute

[30] Spatialladder: Progressive training for spatial reasoning in vision-language models PDF

Cannot Refute

[59] Llava-cot: Let vision language models reason step-by-step PDF

Cannot Refute

[60] Abo: Dataset and benchmarks for real-world 3d object understanding PDF

Cannot Refute

[61] DUSt3R: Geometric 3D Vision Made Easy PDF

Cannot Refute

[62] Measuring multimodal mathematical reasoning with math-vision dataset PDF

Cannot Refute

[63] Rvtbench: A benchmark for visual reasoning tasks PDF

Cannot Refute

Contribution

Systematic evaluation framework across four geometric reasoning tasks

Contribution

Empirical findings revealing fundamental gaps in geometric understanding

[14] SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model PDF

Cannot Refute

[39] Metaspatial: Reinforcing 3d spatial reasoning in vlms for the metaverse PDF

Cannot Refute

[51] Visual Agentic AI for Spatial Reasoning with a Dynamic API PDF

Cannot Refute

[52] VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction PDF

Cannot Refute

[53] A neural representation framework with llm-driven spatial reasoning for open-vocabulary 3d visual grounding PDF

Cannot Refute

[54] Geometric deep learning for computer-aided design: A survey PDF

Cannot Refute

[55] A multimodal generative AI-driven 3D geographic scene reconstruction method PDF

Cannot Refute

[56] Geometry-based Distance Decomposition for Monocular 3D Object Detection PDF

Cannot Refute

[57] Surprise3d: A dataset for spatial understanding and reasoning in complex 3d scenes PDF

Cannot Refute

[58] Weakly-Supervised 3D Spatial Reasoning for Text-Based Visual Question Answering PDF

Cannot Refute

GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Probing the 3d awareness of visual foundation models PDF

[8] Ava-bench: Atomic visual ability benchmark for vision foundation models PDF

[19] E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models PDF

[32] Shape and texture recognition in large vision-language models PDF

Contribution Analysis

GIQ benchmark dataset for evaluating geometric reasoning

[2] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

[14] SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model PDF

[22] SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities PDF

[29] Mind the gap: Benchmarking spatial reasoning in vision-language models PDF

[30] Spatialladder: Progressive training for spatial reasoning in vision-language models PDF

[59] Llava-cot: Let vision language models reason step-by-step PDF

[60] Abo: Dataset and benchmarks for real-world 3d object understanding PDF

[61] DUSt3R: Geometric 3D Vision Made Easy PDF

[62] Measuring multimodal mathematical reasoning with math-vision dataset PDF

[63] Rvtbench: A benchmark for visual reasoning tasks PDF

Systematic evaluation framework across four geometric reasoning tasks

Empirical findings revealing fundamental gaps in geometric understanding

[14] SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model PDF

[39] Metaspatial: Reinforcing 3d spatial reasoning in vlms for the metaverse PDF

[51] Visual Agentic AI for Spatial Reasoning with a Dynamic API PDF

[52] VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction PDF

[53] A neural representation framework with llm-driven spatial reasoning for open-vocabulary 3d visual grounding PDF

[54] Geometric deep learning for computer-aided design: A survey PDF

[55] A multimodal generative AI-driven 3D geographic scene reconstruction method PDF

[56] Geometry-based Distance Decomposition for Monocular 3D Object Detection PDF

[57] Surprise3d: A dataset for spatial understanding and reasoning in complex 3d scenes PDF

[58] Weakly-Supervised 3D Spatial Reasoning for Text-Based Visual Question Answering PDF

Table of Contents