Do 3D Large Language Models Really Understand 3D Spatial Relationships?

ICLR 2026 Conference SubmissionAnonymous Authors
3D-LLM3D spatial reasoning
Abstract:

Recent 3D Large-Language Models (3D-LLMs) claim to understand 3D worlds, especially spatial relationships among objects. Yet, we find that simply fine-tuning a language model on text-only question-answer pairs can perform comparably or even surpass these methods on the SQA3D benchmark without using any 3D input. This indicates that the SQA3D benchmark may not able to detect if the model exploits textual shortcuts rather than engages in 3D-aware reasoning. To address this issue, we introduce Real-3DQA, a more rigorous evaluation benchmark that filters out easy-to-guess questions and introduces a structured taxonomy to assess various aspects of 3D reasoning. Experiments on Real-3DQA confirm that existing 3D-LLMs struggle with spatial relationships once simple cues are removed. We further propose a 3D-reweighted training objective that leverages negative samples via explicit 3D-relation alignment, substantially enhancing 3D-LLMs’ performance in spatial reasoning tasks. Our findings underscore the need for robust benchmarks and tailored training strategies to advance genuine 3D vision-language understanding.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
27
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: spatial reasoning in 3D large language models. The field has rapidly organized itself around several complementary directions. At the foundation, researchers explore how to represent and encode 3D scenes—whether through point clouds, voxel grids, or neural features—so that language models can process geometric information. Building on these representations, a dense branch focuses on spatial reasoning mechanisms and architectures that enable models to understand relations like 'above', 'behind', or 'next to'. Enhancement techniques form another active area, developing methods such as prompting strategies, chain-of-thought reasoning, and memory modules to improve spatial comprehension. Meanwhile, evaluation benchmarks and analysis works provide the diagnostic tools needed to measure progress, including studies of robustness and disambiguation. Application-driven research demonstrates how spatial reasoning powers embodied AI tasks like navigation and manipulation, while surveys and meta-analyses synthesize emerging trends across these branches. Within the evaluation landscape, a particularly important theme concerns robustness and disambiguation: how well do models handle ambiguous spatial references, viewpoint changes, or adversarial perturbations? 3D LLM Spatial Understanding[0] sits squarely in this analytical cluster, examining how models resolve spatial ambiguities in complex scenes. It shares common ground with Evaluating Spatial Understanding[31], which probes fundamental spatial comprehension, and 3D Spatial Disambiguation[35], which specifically targets reference resolution challenges. These works collectively reveal that while many models excel on clean benchmarks, they often struggle when spatial descriptions become underspecified or when multiple plausible interpretations exist. This line of inquiry complements the broader evaluation efforts—such as those developing comprehensive question-answering datasets—by focusing on the edge cases and failure modes that expose the limits of current spatial reasoning capabilities.

Claimed Contributions

Real-3DQA benchmark for evaluating genuine 3D spatial reasoning

The authors propose Real-3DQA, a new benchmark derived from SQA3D that filters out questions solvable by linguistic shortcuts alone and introduces a viewpoint rotation score to more rigorously evaluate whether 3D-LLMs truly understand spatial relationships rather than exploiting textual patterns.

7 retrieved papers
Viewpoint Rotation Score metric for cross-question consistency

The authors introduce the Viewpoint Rotation Score (VRS), a new evaluation metric that tests whether models maintain spatial consistency by measuring their ability to answer logically equivalent questions across different viewpoints (rotations of 90, 180, and 270 degrees).

10 retrieved papers
3D-aware reweighted fine-tuning strategy

The authors develop a training strategy called 3D-aware Reweighted Fine-tuning (3DR-FT) that quantifies the 3D dependency of each question and adaptively reweights training samples to encourage models to rely more on genuine 3D spatial information rather than textual shortcuts.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Real-3DQA benchmark for evaluating genuine 3D spatial reasoning

The authors propose Real-3DQA, a new benchmark derived from SQA3D that filters out questions solvable by linguistic shortcuts alone and introduces a viewpoint rotation score to more rigorously evaluate whether 3D-LLMs truly understand spatial relationships rather than exploiting textual patterns.

Contribution

Viewpoint Rotation Score metric for cross-question consistency

The authors introduce the Viewpoint Rotation Score (VRS), a new evaluation metric that tests whether models maintain spatial consistency by measuring their ability to answer logically equivalent questions across different viewpoints (rotations of 90, 180, and 270 degrees).

Contribution

3D-aware reweighted fine-tuning strategy

The authors develop a training strategy called 3D-aware Reweighted Fine-tuning (3DR-FT) that quantifies the 3D dependency of each question and adaptively reweights training samples to encourage models to rely more on genuine 3D spatial information rather than textual shortcuts.