Do 3D Large Language Models Really Understand 3D Spatial Relationships?
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose Real-3DQA, a new benchmark derived from SQA3D that filters out questions solvable by linguistic shortcuts alone and introduces a viewpoint rotation score to more rigorously evaluate whether 3D-LLMs truly understand spatial relationships rather than exploiting textual patterns.
The authors introduce the Viewpoint Rotation Score (VRS), a new evaluation metric that tests whether models maintain spatial consistency by measuring their ability to answer logically equivalent questions across different viewpoints (rotations of 90, 180, and 270 degrees).
The authors develop a training strategy called 3D-aware Reweighted Fine-tuning (3DR-FT) that quantifies the 3D dependency of each question and adaptively reweights training samples to encourage models to rely more on genuine 3D spatial information rather than textual shortcuts.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[31] Evaluating Spatial Understanding of Large Language Models PDF
[35] 3D Spatial Understanding in MLLMs: Disambiguation and Evaluation PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Real-3DQA benchmark for evaluating genuine 3D spatial reasoning
The authors propose Real-3DQA, a new benchmark derived from SQA3D that filters out questions solvable by linguistic shortcuts alone and introduces a viewpoint rotation score to more rigorously evaluate whether 3D-LLMs truly understand spatial relationships rather than exploiting textual patterns.
[18] Surprise3d: A dataset for spatial understanding and reasoning in complex 3d scenes PDF
[51] Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes PDF
[52] Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning PDF
[53] Translating Words to Worlds: Zero-Shot Synthesis of 3D Terrain from Textual Descriptions Using Large Language Models PDF
[54] PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model PDF
[55] Imagine in Space: Exploring the Frontier of Spatial Intelligence and Reasoning Efficiency in Vision Language Models PDF
[56] From Narratives to Destinations: SemanticâSpatial Modeling of Tourism Trends Using Geotagged Reviews PDF
Viewpoint Rotation Score metric for cross-question consistency
The authors introduce the Viewpoint Rotation Score (VRS), a new evaluation metric that tests whether models maintain spatial consistency by measuring their ability to answer logically equivalent questions across different viewpoints (rotations of 90, 180, and 270 degrees).
[67] Vistadream: Sampling multiview consistent images for single-view scene reconstruction PDF
[68] ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models PDF
[69] View invariant learning for vision-language navigation in continuous environments PDF
[70] Multi-view Consistent 3D Panoptic Scene Understanding PDF
[71] Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning PDF
[72] 4d driving scene generation with stereo forcing PDF
[73] Multi-level embedding and alignment network with consistency and invariance learning for cross-view geo-localization PDF
[74] IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control PDF
[75] Learning fine-grained view-invariant representations from unpaired ego-exo videos via temporal alignment PDF
[76] View-invariant probabilistic embedding for human pose PDF
3D-aware reweighted fine-tuning strategy
The authors develop a training strategy called 3D-aware Reweighted Fine-tuning (3DR-FT) that quantifies the 3D dependency of each question and adaptively reweights training samples to encourage models to rely more on genuine 3D spatial information rather than textual shortcuts.