DepthLM: Metric Depth from Vision Language Models
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce DepthLM, a framework that enables vision language models to achieve expert-level accuracy in pixel-level metric depth estimation. The method uses visual prompting with rendered markers for pixel reference and intrinsic-conditioned augmentation to resolve camera ambiguity, requiring no architectural modifications or specialized loss functions.
The authors create DepthLMBench, a curated mixture of public datasets (approximately 16M training images from 7 datasets) that enables training and evaluation of VLMs for 3D understanding tasks, allowing direct comparison with pure vision models on metric depth estimation.
The authors demonstrate that DepthLM can be extended to train a unified model handling multiple 3D understanding tasks (including principal axis distance, speed estimation, time estimation, two-point distance, and camera pose estimation) using the same architecture and training framework, without requiring task-specific designs.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[1] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
DepthLM framework for metric depth estimation in VLMs
The authors introduce DepthLM, a framework that enables vision language models to achieve expert-level accuracy in pixel-level metric depth estimation. The method uses visual prompting with rendered markers for pixel reference and intrinsic-conditioned augmentation to resolve camera ambiguity, requiring no architectural modifications or specialized loss functions.
[1] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF
[6] RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics PDF
[15] Can large language models understand depth in monocular images without prior vision knowledge? PDF
[17] Learning to prompt clip for monocular depth estimation: Exploring the limits of human language PDF
[22] TPDepth: Leveraging Text Prompts with ControlNet to Boost Diffusion-based Depth Estimation PDF
[29] Vision-language embodiment for monocular depth estimation PDF
[33] FloodVision: Urban Flood Depth Estimation Using Foundation Vision-Language Models and Domain Knowledge Graph PDF
[69] Learning to adapt clip for few-shot monocular depth estimation PDF
[70] Explore until Confident: Efficient Exploration for Embodied Question Answering PDF
[71] Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation PDF
DepthLMBench benchmark suite
The authors create DepthLMBench, a curated mixture of public datasets (approximately 16M training images from 7 datasets) that enables training and evaluation of VLMs for 3D understanding tasks, allowing direct comparison with pure vision models on metric depth estimation.
[19] Mm-spatial: Exploring 3d spatial understanding in multimodal llms PDF
[51] UniDepth: Universal monocular metric depth estimation PDF
[52] Survey on monocular metric depth estimation PDF
[53] Selfocc: Self-supervised vision-based 3d occupancy prediction PDF
[54] E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models PDF
[55] 3D Packing for Self-Supervised Monocular Depth Estimation PDF
[56] On the metrics for evaluating monocular depth estimation PDF
[57] From 2d to 3d: Re-thinking benchmarking of monocular depth prediction PDF
[58] Towards Depth Foundation Model: Recent Trends in Vision-Based Depth Estimation PDF
[59] Occdepth: A depth-aware method for 3d semantic scene completion PDF
Unified VLM for diverse 3D understanding tasks
The authors demonstrate that DepthLM can be extended to train a unified model handling multiple 3D understanding tasks (including principal axis distance, speed estimation, time estimation, two-point distance, and camera pose estimation) using the same architecture and training framework, without requiring task-specific designs.