Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks

ICLR 2026 Conference SubmissionAnonymous Authors
Orientation Understanding3D Scene UnderstandingMLLM ProbingBenchmark DatasetComputer Vision
Abstract:

Object orientation understanding represents a fundamental challenge in visual perception that underpins critical real-world applications like robotic manipulation and augmented reality. However, current vision-language benchmarks fail to isolate and evaluate this core capability, often conflating it with positional relationships (such as above/below or proximity between objects) and general scene understanding. To address this, we introduce Discriminative Orientation Reasoning Intelligence (DORI), a comprehensive hierarchical benchmark that establishes object orientation perception as a primary evaluation target. DORI rigorously assesses four essential dimensions of object(s) orientation comprehension: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding. DORI provides valuable insights on how existing multi-modal systems process and understand object orientations through carefully curated tasks from 14 sources that spans 6767 object categories across synthetic and real-world scenarios. Our evaluation of 2222 state-of-the-art vision-language models using DORI reveals critical limitations: even the best models achieve only 54.2%54.2\% accuracy on coarse tasks and 45.0%45.0\% on granular orientation judgments, with performance deteriorating substantially for tasks requiring reference frame shifts or compound rotations. These findings demonstrate the urgent need for dedicated orientation representation mechanisms in future architectures, as models show a systematic inability to perform precise angular estimations, track orientation changes across multiple viewpoints, and understand compound rotations—suggesting fundamental limitations in their internal 3D spatial representations. As the first diagnostic framework specifically designed for advancing orientation awareness in multimodal systems, DORI offers immediate implications for improving robotic control, 3D scene reconstruction, and human-AI interaction in physical environments

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DORI, a benchmark isolating object orientation perception across four dimensions: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding. According to the taxonomy tree, this work resides in the 'Orientation-Specific Evaluation' leaf under 'Spatial Reasoning Foundations and Benchmarking'. Notably, this leaf contains only one paper—the present work—indicating that orientation-specific evaluation represents a sparse, underexplored research direction. In contrast, sibling leaves like 'General Spatial Reasoning Evaluation' contain four papers, suggesting that prior benchmarking efforts have focused on broader spatial capabilities rather than isolating orientation as a distinct perceptual challenge.

The taxonomy tree reveals that neighboring leaves address related but distinct aspects of spatial understanding. 'Egocentric and Multi-View Spatial Assessment' (three papers) evaluates perspective-dependent reasoning, while '3D Spatial Understanding Benchmarks' (two papers) focuses on depth and volumetric properties. The 'Large-Scale Spatial Reasoning Datasets' leaf (one paper) provides million-scale training data across diverse instruction formats. DORI's scope_note explicitly excludes general spatial reasoning and positional relationships, positioning it as a complementary evaluation tool that targets a capability often conflated with broader scene understanding in existing benchmarks like SpatialRGPT and MM-Spatial.

Among the 30 candidates examined through semantic search and citation expansion, none clearly refute the three core contributions. For the DORI benchmark itself, 10 candidates were examined with zero refutable overlaps. Similarly, the two-tiered assessment framework and structured prompt design methodology each faced 10 candidates with no clear prior work providing the same evaluation structure. This suggests that within the limited search scope, the specific combination of orientation-focused tasks, hierarchical question design, and isolation from positional reasoning appears novel. However, the small search scale (30 candidates total) means the analysis cannot rule out relevant prior work outside the top semantic matches.

Given the limited literature search scope and the paper's position as the sole occupant of its taxonomy leaf, the work appears to address a genuine gap in orientation-specific evaluation. The absence of sibling papers and the explicit exclude_note distinguishing orientation from general spatial reasoning reinforce this impression. However, the analysis is constrained by examining only 30 candidates, and the broader 'Spatial Reasoning Foundations and Benchmarking' branch contains 16 papers that may touch on orientation indirectly. A more exhaustive search could reveal overlapping evaluation dimensions in general spatial benchmarks.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Object orientation understanding in multimodal language models. The field has organized itself around several complementary branches that address different facets of spatial reasoning in vision-language systems. Spatial Reasoning Foundations and Benchmarking establishes evaluation protocols and datasets to measure how well models grasp directional relations, object poses, and viewpoint-dependent properties, with works like SpatialRGPT[1] and MM-Spatial[11] providing comprehensive testbeds. Architectural Enhancements for Spatial Understanding explores model designs that explicitly encode geometric information, often through specialized tokens or coordinate representations, as seen in SpatialVLM[19] and SpatialBot[4]. Domain-Specific Spatial Applications tailors spatial reasoning to particular contexts such as robotics, navigation, and autonomous driving, exemplified by NavGPT[7] and DriveVLM[9]. Spatiotemporal and Multi-Image Reasoning extends spatial understanding across time or multiple viewpoints, while Specialized Spatial Reasoning Tasks targets narrower challenges like depth estimation or layout analysis, and 3D Object Detection with Vision-Language Models bridges traditional detection pipelines with language-grounded spatial queries. A central tension across these branches concerns whether to rely on pretrained vision-language backbones with minimal modification or to inject explicit spatial inductive biases through architectural changes and specialized training data. Many studies in the benchmarking branch reveal that even state-of-the-art models struggle with fine-grained orientation judgments, prompting works in the architectural branch to propose coordinate-aware encoders or region-level grounding mechanisms. Right Side Up[0] sits within the Orientation-Specific Evaluation cluster of the benchmarking branch, focusing specifically on how models interpret object rotations and canonical orientations. Its emphasis contrasts with broader spatial benchmarks like SpatialRGPT[1], which cover a wider range of relational reasoning tasks, and complements egocentric perspective studies such as Egocentric Orientation[5], which examine viewpoint-dependent spatial understanding. By isolating orientation as a distinct capability, this work highlights a subtle but critical gap in current multimodal models' perceptual repertoire.

Claimed Contributions

DORI benchmark for object orientation understanding

The authors propose DORI, a new benchmark designed to evaluate multimodal language models on object orientation understanding. It decomposes this capability into four fundamental dimensions: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding, using 33,656 multiple-choice questions from 14 sources.

10 retrieved papers
Two-tiered assessment framework with coarse and fine-grained questions

The authors introduce a hierarchical evaluation approach that includes both coarse-grained questions for basic categorical judgments and fine-grained questions for precise angular measurements. This enables systematic evaluation from fundamental perception to advanced orientation reasoning.

10 retrieved papers
Structured prompt design methodology for isolating orientation perception

The authors develop a systematic three-step process for creating evaluation prompts that isolate orientation perception from confounding factors such as object recognition difficulty, scene clutter, and linguistic ambiguity. The prompts follow a structured format with five key components including task description, contextual information, step-by-step instructions, multiple-choice options, and concrete examples.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DORI benchmark for object orientation understanding

The authors propose DORI, a new benchmark designed to evaluate multimodal language models on object orientation understanding. It decomposes this capability into four fundamental dimensions: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding, using 33,656 multiple-choice questions from 14 sources.

Contribution

Two-tiered assessment framework with coarse and fine-grained questions

The authors introduce a hierarchical evaluation approach that includes both coarse-grained questions for basic categorical judgments and fine-grained questions for precise angular measurements. This enables systematic evaluation from fundamental perception to advanced orientation reasoning.

Contribution

Structured prompt design methodology for isolating orientation perception

The authors develop a systematic three-step process for creating evaluation prompts that isolate orientation perception from confounding factors such as object recognition difficulty, scene clutter, and linguistic ambiguity. The prompts follow a structured format with five key components including task description, contextual information, step-by-step instructions, multiple-choice options, and concrete examples.