DepthLM: Metric Depth from Vision Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
Metric depthVision language model
Abstract:

Vision language models (VLMs) can flexibly address various vision tasks through text interactions. Although successful in semantic understanding, state-of-the-art VLMs including GPT-5 still struggle in understanding 3D from 2D inputs. On the other hand, expert pure vision models achieve super-human accuracy in metric depth estimation, a key 3D understanding task. However, they require task-specific architectures and losses. Such difference motivates us to ask: Can VLMs reach expert-level accuracy without architecture or loss change? We take per-pixel metric depth estimation as the representative task and show that the answer is yes! Surprisingly, comprehensive analysis shows that text-based supervised-finetuning with sparse labels is sufficient for VLMs to unlock strong 3D understanding, no dense prediction head or complex regression/regularization loss is needed. The bottleneck lies in pixel reference and cross-dataset camera ambiguity, which we address through visual prompting and intrinsic-conditioned augmentation. With much smaller models, our method DepthLM surpasses the accuracy of most advanced VLMs by over 2x, making VLMs for the first time comparable with pure vision models. Meanwhile, the simplicity makes DepthLM scalable to more complex 3D tasks with a unified model. Code will be released to the community.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: metric depth estimation from vision language models. The field has evolved from traditional monocular depth methods toward integrating language and multimodal reasoning with depth perception. The taxonomy reveals several major branches: some focus on VLM architecture and training strategies that embed depth understanding into large-scale vision-language models (e.g., SpatialRGPT[1], SpatialBot[2]), while others explore language-guided depth estimation from contrastive models like CLIP (Prompt CLIP Depth[17], WorDepth[18]) or diffusion-based approaches (PriorDiffusion[34]). Additional branches address LLM-based depth reasoning (LLM Depth Understanding[15], Language Understand Depth[32]), depth-conditioned planning for robotics (QDepth VLA[5], RoboRefer[6]), and benchmarking spatial understanding in VLMs (MM Spatial[19]). Unified multi-task models (Unified IO[7], Florence VL[23]) and specialized depth applications (Underwater Metric Depth[40], Adverse Weather Depth[38]) round out the landscape, alongside survey literature and methods for relative-to-metric depth conversion. Recent work has concentrated on two contrasting themes: end-to-end VLM architectures that jointly learn language and depth representations versus modular pipelines that leverage pretrained language models to guide or interpret depth outputs. DepthLM[0] sits squarely in the core metric depth estimation branch, emphasizing direct prediction of metric depth from VLMs without relying solely on contrastive or diffusion priors. This positions it closely alongside SpatialRGPT[1] and SpatialBot[2], which similarly integrate spatial reasoning into large vision-language frameworks, though those works often prioritize broader spatial understanding tasks beyond pure depth estimation. Compared to language-guided contrastive methods (WorDepth[18]) or diffusion-based depth generation (PriorDiffusion[34]), DepthLM[0] appears more focused on producing accurate metric depth maps as a primary output rather than as an auxiliary signal for downstream reasoning or generation. Open questions remain around how best to fuse language semantics with geometric cues and whether unified architectures or specialized depth modules offer better trade-offs in accuracy, generalization, and computational efficiency.

Claimed Contributions

DepthLM framework for metric depth estimation in VLMs

The authors introduce DepthLM, a framework that enables vision language models to achieve expert-level accuracy in pixel-level metric depth estimation. The method uses visual prompting with rendered markers for pixel reference and intrinsic-conditioned augmentation to resolve camera ambiguity, requiring no architectural modifications or specialized loss functions.

10 retrieved papers
DepthLMBench benchmark suite

The authors create DepthLMBench, a curated mixture of public datasets (approximately 16M training images from 7 datasets) that enables training and evaluation of VLMs for 3D understanding tasks, allowing direct comparison with pure vision models on metric depth estimation.

10 retrieved papers
Unified VLM for diverse 3D understanding tasks

The authors demonstrate that DepthLM can be extended to train a unified model handling multiple 3D understanding tasks (including principal axis distance, speed estimation, time estimation, two-point distance, and camera pose estimation) using the same architecture and training framework, without requiring task-specific designs.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DepthLM framework for metric depth estimation in VLMs

The authors introduce DepthLM, a framework that enables vision language models to achieve expert-level accuracy in pixel-level metric depth estimation. The method uses visual prompting with rendered markers for pixel reference and intrinsic-conditioned augmentation to resolve camera ambiguity, requiring no architectural modifications or specialized loss functions.

Contribution

DepthLMBench benchmark suite

The authors create DepthLMBench, a curated mixture of public datasets (approximately 16M training images from 7 datasets) that enables training and evaluation of VLMs for 3D understanding tasks, allowing direct comparison with pure vision models on metric depth estimation.

Contribution

Unified VLM for diverse 3D understanding tasks

The authors demonstrate that DepthLM can be extended to train a unified model handling multiple 3D understanding tasks (including principal axis distance, speed estimation, time estimation, two-point distance, and camera pose estimation) using the same architecture and training framework, without requiring task-specific designs.