Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Orientation Understanding3D Scene UnderstandingMLLM ProbingBenchmark DatasetComputer Vision

Object orientation understanding represents a fundamental challenge in visual perception that underpins critical real-world applications like robotic manipulation and augmented reality. However, current vision-language benchmarks fail to isolate and evaluate this core capability, often conflating it with positional relationships (such as above/below or proximity between objects) and general scene understanding. To address this, we introduce Discriminative Orientation Reasoning Intelligence (DORI), a comprehensive hierarchical benchmark that establishes object orientation perception as a primary evaluation target. DORI rigorously assesses four essential dimensions of object(s) orientation comprehension: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding. DORI provides valuable insights on how existing multi-modal systems process and understand object orientations through carefully curated tasks from 14 sources that spans $67$ object categories across synthetic and real-world scenarios. Our evaluation of $22$ state-of-the-art vision-language models using DORI reveals critical limitations: even the best models achieve only $54.2\%$ accuracy on coarse tasks and $45.0\%$ on granular orientation judgments, with performance deteriorating substantially for tasks requiring reference frame shifts or compound rotations. These findings demonstrate the urgent need for dedicated orientation representation mechanisms in future architectures, as models show a systematic inability to perform precise angular estimations, track orientation changes across multiple viewpoints, and understand compound rotations—suggesting fundamental limitations in their internal 3D spatial representations. As the first diagnostic framework specifically designed for advancing orientation awareness in multimodal systems, DORI offers immediate implications for improving robotic control, 3D scene reconstruction, and human-AI interaction in physical environments

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DORI, a benchmark isolating object orientation perception across four dimensions: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding. According to the taxonomy tree, this work resides in the 'Orientation-Specific Evaluation' leaf under 'Spatial Reasoning Foundations and Benchmarking'. Notably, this leaf contains only one paper—the present work—indicating that orientation-specific evaluation represents a sparse, underexplored research direction. In contrast, sibling leaves like 'General Spatial Reasoning Evaluation' contain four papers, suggesting that prior benchmarking efforts have focused on broader spatial capabilities rather than isolating orientation as a distinct perceptual challenge.

The taxonomy tree reveals that neighboring leaves address related but distinct aspects of spatial understanding. 'Egocentric and Multi-View Spatial Assessment' (three papers) evaluates perspective-dependent reasoning, while '3D Spatial Understanding Benchmarks' (two papers) focuses on depth and volumetric properties. The 'Large-Scale Spatial Reasoning Datasets' leaf (one paper) provides million-scale training data across diverse instruction formats. DORI's scope_note explicitly excludes general spatial reasoning and positional relationships, positioning it as a complementary evaluation tool that targets a capability often conflated with broader scene understanding in existing benchmarks like SpatialRGPT and MM-Spatial.

Among the 30 candidates examined through semantic search and citation expansion, none clearly refute the three core contributions. For the DORI benchmark itself, 10 candidates were examined with zero refutable overlaps. Similarly, the two-tiered assessment framework and structured prompt design methodology each faced 10 candidates with no clear prior work providing the same evaluation structure. This suggests that within the limited search scope, the specific combination of orientation-focused tasks, hierarchical question design, and isolation from positional reasoning appears novel. However, the small search scale (30 candidates total) means the analysis cannot rule out relevant prior work outside the top semantic matches.

Given the limited literature search scope and the paper's position as the sole occupant of its taxonomy leaf, the work appears to address a genuine gap in orientation-specific evaluation. The absence of sibling papers and the explicit exclude_note distinguishing orientation from general spatial reasoning reinforce this impression. However, the analysis is constrained by examining only 30 candidates, and the broader 'Spatial Reasoning Foundations and Benchmarking' branch contains 16 papers that may touch on orientation indirectly. A more exhaustive search could reveal overlapping evaluation dimensions in general spatial benchmarks.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Object orientation understanding in multimodal language models. The field has organized itself around several complementary branches that address different facets of spatial reasoning in vision-language systems. Spatial Reasoning Foundations and Benchmarking establishes evaluation protocols and datasets to measure how well models grasp directional relations, object poses, and viewpoint-dependent properties, with works like SpatialRGPT[1] and MM-Spatial[11] providing comprehensive testbeds. Architectural Enhancements for Spatial Understanding explores model designs that explicitly encode geometric information, often through specialized tokens or coordinate representations, as seen in SpatialVLM[19] and SpatialBot[4]. Domain-Specific Spatial Applications tailors spatial reasoning to particular contexts such as robotics, navigation, and autonomous driving, exemplified by NavGPT[7] and DriveVLM[9]. Spatiotemporal and Multi-Image Reasoning extends spatial understanding across time or multiple viewpoints, while Specialized Spatial Reasoning Tasks targets narrower challenges like depth estimation or layout analysis, and 3D Object Detection with Vision-Language Models bridges traditional detection pipelines with language-grounded spatial queries. A central tension across these branches concerns whether to rely on pretrained vision-language backbones with minimal modification or to inject explicit spatial inductive biases through architectural changes and specialized training data. Many studies in the benchmarking branch reveal that even state-of-the-art models struggle with fine-grained orientation judgments, prompting works in the architectural branch to propose coordinate-aware encoders or region-level grounding mechanisms. Right Side Up[0] sits within the Orientation-Specific Evaluation cluster of the benchmarking branch, focusing specifically on how models interpret object rotations and canonical orientations. Its emphasis contrasts with broader spatial benchmarks like SpatialRGPT[1], which cover a wider range of relational reasoning tasks, and complements egocentric perspective studies such as Egocentric Orientation[5], which examine viewpoint-dependent spatial understanding. By isolating orientation as a distinct capability, this work highlights a subtle but critical gap in current multimodal models' perceptual repertoire.

Claimed Contributions

DORI benchmark for object orientation understanding

10 retrieved papers

The authors propose DORI, a new benchmark designed to evaluate multimodal language models on object orientation understanding. It decomposes this capability into four fundamental dimensions: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding, using 33,656 multiple-choice questions from 14 sources.

10 retrieved papers

Two-tiered assessment framework with coarse and fine-grained questions

10 retrieved papers

The authors introduce a hierarchical evaluation approach that includes both coarse-grained questions for basic categorical judgments and fine-grained questions for precise angular measurements. This enables systematic evaluation from fundamental perception to advanced orientation reasoning.

10 retrieved papers

Structured prompt design methodology for isolating orientation perception

10 retrieved papers

The authors develop a systematic three-step process for creating evaluation prompts that isolate orientation perception from confounding factors such as object recognition difficulty, scene clutter, and linguistic ambiguity. The prompts follow a structured format with five key components including task description, contextual information, step-by-step instructions, multiple-choice options, and concrete examples.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DORI benchmark for object orientation understanding

[5] Is 'right'right? enhancing object orientation understanding in multimodal large language models through egocentric instruction tuning PDF

Cannot Refute

[11] MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs PDF

Cannot Refute

[24] Mind the gap: Benchmarking spatial reasoning in vision-language models PDF

Cannot Refute

[34] GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs PDF

Cannot Refute

[51] CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models PDF

Cannot Refute

[52] Visfactor: Benchmarking fundamental visual cognition in multimodal large language models PDF

Cannot Refute

[53] Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models PDF

Cannot Refute

[54] 3dsrbench: A comprehensive 3d spatial reasoning benchmark PDF

Cannot Refute

[55] An empirical analysis on spatial reasoning capabilities of large multimodal models PDF

Cannot Refute

[56] Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Mutimodal Models PDF

Cannot Refute

Contribution

Two-tiered assessment framework with coarse and fine-grained questions

[66] Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System PDF

Cannot Refute

[67] A multi-level spatial assessment framework for identifying land use conflict zones PDF

Cannot Refute

[68] GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains PDF

Cannot Refute

[69] Towards Cross-View Point Correspondence in Vision-Language Models PDF

Cannot Refute

[70] Bottom-up hierarchical detection of territorial spatial pattern based on multi-source heterogeneous data PDF

Cannot Refute

[71] VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs PDF

Cannot Refute

[72] Effect of embedding a cognitive diagnosis into the adaptive dynamic assessment of spatial geometry learning PDF

Cannot Refute

[73] Rorschach Performance Assessment System (R-PAS) for assessing disordered thought and perception. PDF

Cannot Refute

[74] Developmental kindergarten classroom intervention for spatial relational terms PDF

Cannot Refute

[75] Applying cognition-based assessment to elementary school students' development of understanding of area and volume measurement PDF

Cannot Refute

Contribution

Structured prompt design methodology for isolating orientation perception

[29] Joint visual and text prompting for improved object-centric perception with multimodal large language models PDF

Cannot Refute

[57] Compositional Chain-of-Thought Prompting for Large Multimodal Models PDF

Cannot Refute

[58] Blink: Multimodal large language models can see but not perceive PDF

Cannot Refute

[59] VIMA: General Robot Manipulation with Multimodal Prompts PDF

Cannot Refute

[60] DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models PDF

Cannot Refute

[61] Vip-llava: Making large multimodal models understand arbitrary visual prompts PDF

Cannot Refute

[62] Multimodal Prompting with Missing Modalities for Visual Recognition PDF

Cannot Refute

[63] Prompt highlighter: Interactive control for multi-modal llms PDF

Cannot Refute

[64] The Future of MLLM Prompting is Adaptive: A Comprehensive Experimental Evaluation of Prompt Engineering Methods for Robust Multimodal Performance PDF

Cannot Refute

[65] What does CLIP know about a red circle? Visual prompt engineering for VLMs PDF

Cannot Refute

Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

DORI benchmark for object orientation understanding

[5] Is 'right'right? enhancing object orientation understanding in multimodal large language models through egocentric instruction tuning PDF

[11] MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs PDF

[24] Mind the gap: Benchmarking spatial reasoning in vision-language models PDF

[34] GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs PDF

[51] CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models PDF

[52] Visfactor: Benchmarking fundamental visual cognition in multimodal large language models PDF

[53] Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models PDF

[54] 3dsrbench: A comprehensive 3d spatial reasoning benchmark PDF

[55] An empirical analysis on spatial reasoning capabilities of large multimodal models PDF

[56] Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Mutimodal Models PDF

Two-tiered assessment framework with coarse and fine-grained questions

[66] Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System PDF

[67] A multi-level spatial assessment framework for identifying land use conflict zones PDF

[68] GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains PDF

[69] Towards Cross-View Point Correspondence in Vision-Language Models PDF

[70] Bottom-up hierarchical detection of territorial spatial pattern based on multi-source heterogeneous data PDF

[71] VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs PDF

[72] Effect of embedding a cognitive diagnosis into the adaptive dynamic assessment of spatial geometry learning PDF

[73] Rorschach Performance Assessment System (R-PAS) for assessing disordered thought and perception. PDF

[74] Developmental kindergarten classroom intervention for spatial relational terms PDF

[75] Applying cognition-based assessment to elementary school students' development of understanding of area and volume measurement PDF

Structured prompt design methodology for isolating orientation perception

[29] Joint visual and text prompting for improved object-centric perception with multimodal large language models PDF

[57] Compositional Chain-of-Thought Prompting for Large Multimodal Models PDF

[58] Blink: Multimodal large language models can see but not perceive PDF

[59] VIMA: General Robot Manipulation with Multimodal Prompts PDF

[60] DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models PDF

[61] Vip-llava: Making large multimodal models understand arbitrary visual prompts PDF

[62] Multimodal Prompting with Missing Modalities for Visual Recognition PDF

[63] Prompt highlighter: Interactive control for multi-modal llms PDF

[64] The Future of MLLM Prompting is Adaptive: A Comprehensive Experimental Evaluation of Prompt Engineering Methods for Robust Multimodal Performance PDF

[65] What does CLIP know about a red circle? Visual prompt engineering for VLMs PDF

Table of Contents