DepthLM: Metric Depth from Vision Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Metric depthVision language model

Vision language models (VLMs) can flexibly address various vision tasks through text interactions. Although successful in semantic understanding, state-of-the-art VLMs including GPT-5 still struggle in understanding 3D from 2D inputs. On the other hand, expert pure vision models achieve super-human accuracy in metric depth estimation, a key 3D understanding task. However, they require task-specific architectures and losses. Such difference motivates us to ask: Can VLMs reach expert-level accuracy without architecture or loss change? We take per-pixel metric depth estimation as the representative task and show that the answer is yes! Surprisingly, comprehensive analysis shows that text-based supervised-finetuning with sparse labels is sufficient for VLMs to unlock strong 3D understanding, no dense prediction head or complex regression/regularization loss is needed. The bottleneck lies in pixel reference and cross-dataset camera ambiguity, which we address through visual prompting and intrinsic-conditioned augmentation. With much smaller models, our method DepthLM surpasses the accuracy of most advanced VLMs by over 2x, making VLMs for the first time comparable with pure vision models. Meanwhile, the simplicity makes DepthLM scalable to more complex 3D tasks with a unified model. Code will be released to the community.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: metric depth estimation from vision language models. The field has evolved from traditional monocular depth methods toward integrating language and multimodal reasoning with depth perception. The taxonomy reveals several major branches: some focus on VLM architecture and training strategies that embed depth understanding into large-scale vision-language models (e.g., SpatialRGPT[1], SpatialBot[2]), while others explore language-guided depth estimation from contrastive models like CLIP (Prompt CLIP Depth[17], WorDepth[18]) or diffusion-based approaches (PriorDiffusion[34]). Additional branches address LLM-based depth reasoning (LLM Depth Understanding[15], Language Understand Depth[32]), depth-conditioned planning for robotics (QDepth VLA[5], RoboRefer[6]), and benchmarking spatial understanding in VLMs (MM Spatial[19]). Unified multi-task models (Unified IO[7], Florence VL[23]) and specialized depth applications (Underwater Metric Depth[40], Adverse Weather Depth[38]) round out the landscape, alongside survey literature and methods for relative-to-metric depth conversion. Recent work has concentrated on two contrasting themes: end-to-end VLM architectures that jointly learn language and depth representations versus modular pipelines that leverage pretrained language models to guide or interpret depth outputs. DepthLM[0] sits squarely in the core metric depth estimation branch, emphasizing direct prediction of metric depth from VLMs without relying solely on contrastive or diffusion priors. This positions it closely alongside SpatialRGPT[1] and SpatialBot[2], which similarly integrate spatial reasoning into large vision-language frameworks, though those works often prioritize broader spatial understanding tasks beyond pure depth estimation. Compared to language-guided contrastive methods (WorDepth[18]) or diffusion-based depth generation (PriorDiffusion[34]), DepthLM[0] appears more focused on producing accurate metric depth maps as a primary output rather than as an auxiliary signal for downstream reasoning or generation. Open questions remain around how best to fuse language semantics with geometric cues and whether unified architectures or specialized depth modules offer better trade-offs in accuracy, generalization, and computational efficiency.

Claimed Contributions

DepthLM framework for metric depth estimation in VLMs

10 retrieved papers

The authors introduce DepthLM, a framework that enables vision language models to achieve expert-level accuracy in pixel-level metric depth estimation. The method uses visual prompting with rendered markers for pixel reference and intrinsic-conditioned augmentation to resolve camera ambiguity, requiring no architectural modifications or specialized loss functions.

10 retrieved papers

DepthLMBench benchmark suite

10 retrieved papers

The authors create DepthLMBench, a curated mixture of public datasets (approximately 16M training images from 7 datasets) that enables training and evaluation of VLMs for 3D understanding tasks, allowing direct comparison with pure vision models on metric depth estimation.

10 retrieved papers

Unified VLM for diverse 3D understanding tasks

10 retrieved papers

The authors demonstrate that DepthLM can be extended to train a unified model handling multiple 3D understanding tasks (including principal axis distance, speed estimation, time estimation, two-point distance, and camera pose estimation) using the same architecture and training framework, without requiring task-specific designs.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

An-Chieh Cheng, Yang Fu, Qiushan Guo, Jan Kautz, Sifei Liu, Xiaolong Wang, Ruihan Yang, Hongxu Yin (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DepthLM framework for metric depth estimation in VLMs

[1] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

Cannot Refute

[6] RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics PDF

Cannot Refute

[15] Can large language models understand depth in monocular images without prior vision knowledge? PDF

Cannot Refute

[17] Learning to prompt clip for monocular depth estimation: Exploring the limits of human language PDF

Cannot Refute

[22] TPDepth: Leveraging Text Prompts with ControlNet to Boost Diffusion-based Depth Estimation PDF

Cannot Refute

[29] Vision-language embodiment for monocular depth estimation PDF

Cannot Refute

[33] FloodVision: Urban Flood Depth Estimation Using Foundation Vision-Language Models and Domain Knowledge Graph PDF

Cannot Refute

[69] Learning to adapt clip for few-shot monocular depth estimation PDF

Cannot Refute

[70] Explore until Confident: Efficient Exploration for Embodied Question Answering PDF

Cannot Refute

[71] Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation PDF

Cannot Refute

Contribution

DepthLMBench benchmark suite

[19] Mm-spatial: Exploring 3d spatial understanding in multimodal llms PDF

Cannot Refute

[51] UniDepth: Universal monocular metric depth estimation PDF

Cannot Refute

[52] Survey on monocular metric depth estimation PDF

Cannot Refute

[53] Selfocc: Self-supervised vision-based 3d occupancy prediction PDF

Cannot Refute

[54] E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models PDF

Cannot Refute

[55] 3D Packing for Self-Supervised Monocular Depth Estimation PDF

Cannot Refute

[56] On the metrics for evaluating monocular depth estimation PDF

Cannot Refute

[57] From 2d to 3d: Re-thinking benchmarking of monocular depth prediction PDF

Cannot Refute

[58] Towards Depth Foundation Model: Recent Trends in Vision-Based Depth Estimation PDF

Cannot Refute

[59] Occdepth: A depth-aware method for 3d semantic scene completion PDF

Cannot Refute

Contribution

Unified VLM for diverse 3D understanding tasks

[7] Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks PDF

Cannot Refute

[60] Inst3d-lmm: Instance-aware 3d scene understanding with multi-modal instruction tuning PDF

Cannot Refute

[61] Delving into multi-modal multi-task foundation models for road scene understanding: From learning paradigm perspectives PDF

Cannot Refute

[62] Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation PDF

Cannot Refute

[63] Continuous 3D Perception Model with Persistent State PDF

Cannot Refute

[64] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation PDF

Cannot Refute

[65] Unifying 3d vision-language understanding via promptable queries PDF

Cannot Refute

[66] Robix: A Unified Model for Robot Interaction, Reasoning and Planning PDF

Cannot Refute

[67] A unified framework for 3d scene understanding PDF

Cannot Refute

[68] Uni3d: Exploring unified 3d representation at scale PDF

Cannot Refute

DepthLM: Metric Depth from Vision Language Models

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

Contribution Analysis

DepthLM framework for metric depth estimation in VLMs

[1] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

[6] RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics PDF

[15] Can large language models understand depth in monocular images without prior vision knowledge? PDF

[17] Learning to prompt clip for monocular depth estimation: Exploring the limits of human language PDF

[22] TPDepth: Leveraging Text Prompts with ControlNet to Boost Diffusion-based Depth Estimation PDF

[29] Vision-language embodiment for monocular depth estimation PDF

[33] FloodVision: Urban Flood Depth Estimation Using Foundation Vision-Language Models and Domain Knowledge Graph PDF

[69] Learning to adapt clip for few-shot monocular depth estimation PDF

[70] Explore until Confident: Efficient Exploration for Embodied Question Answering PDF

[71] Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation PDF

DepthLMBench benchmark suite

[19] Mm-spatial: Exploring 3d spatial understanding in multimodal llms PDF

[51] UniDepth: Universal monocular metric depth estimation PDF

[52] Survey on monocular metric depth estimation PDF

[53] Selfocc: Self-supervised vision-based 3d occupancy prediction PDF

[54] E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models PDF

[55] 3D Packing for Self-Supervised Monocular Depth Estimation PDF

[56] On the metrics for evaluating monocular depth estimation PDF

[57] From 2d to 3d: Re-thinking benchmarking of monocular depth prediction PDF

[58] Towards Depth Foundation Model: Recent Trends in Vision-Based Depth Estimation PDF

[59] Occdepth: A depth-aware method for 3d semantic scene completion PDF

Unified VLM for diverse 3D understanding tasks

[7] Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks PDF

[60] Inst3d-lmm: Instance-aware 3d scene understanding with multi-modal instruction tuning PDF

[61] Delving into multi-modal multi-task foundation models for road scene understanding: From learning paradigm perspectives PDF

[62] Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation PDF

[63] Continuous 3D Perception Model with Persistent State PDF

[64] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation PDF

[65] Unifying 3d vision-language understanding via promptable queries PDF

[66] Robix: A Unified Model for Robot Interaction, Reasoning and Planning PDF

[67] A unified framework for 3d scene understanding PDF

[68] Uni3d: Exploring unified 3d representation at scale PDF

Table of Contents