Hyden: A Hybrid Dual-Path Encoder for Monocular Geometry of High-resolution Images

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

3D visionMonocular Depth EstimationMonocular Surface Normal Estimation

We present a hybrid dual-path vision encoder (Hyden) for high-resolution monocular depth, point map and surface normal estimation, surpassing state-of-the-art accuracy with a fraction of the inference cost. The architecture pairs a low-resolution Vision Transformer branch for global context with a full-resolution CNN branch for fine details, fusing features via a lightweight MLP before decoding. By exploiting the linear scaling of CNNs and constraining transformer computation to a fixed resolution, the model delivers fast inference even on multi-megapixel inputs. To overcome the scarcity of high-quality high-resolution supervision, we introduce a self-distillation framework that generates pseudo-labels from existing models at both lower resolution full images and high-resolution crops—global labels preserve geometric accuracy, while local labels capture sharper details. To demonstrate the flexibility of our approach, we integrate Hyden and our self-distillation method into DepthAnything-v2 for depth estimation and MoGe2 for surface normal and metric point map prediction, achieving state-of-the-art results on high-resolution benchmarks with the lowest inference latency among competing methods.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a hybrid dual-path architecture pairing a low-resolution Vision Transformer branch with a full-resolution CNN branch for depth, normal, and point map estimation. It resides in the 'High-Resolution Processing Strategies' leaf, which contains four papers addressing tiling, patching, or adaptive merging for high-resolution depth. This leaf is part of the broader 'General-Purpose Depth Prediction' branch, indicating a moderately active research direction focused on overcoming memory and computational bottlenecks in megapixel inputs. The taxonomy reveals this is neither a sparse nor overcrowded niche.

The taxonomy tree shows neighboring leaves including 'Multi-Scale and Coarse-to-Fine Architectures' (four papers) and 'Fast and Efficient Architectures' (one paper), both exploring hierarchical feature fusion but without explicit high-resolution focus. The 'Multi-Attribute Geometry Estimation' branch (two papers) addresses joint prediction of depth, normals, and point maps, closely related to the paper's multi-output formulation. The 'Metric and Zero-Shot Depth Estimation' leaf (two papers) tackles scale recovery, a concern also relevant here. The taxonomy's scope and exclude notes clarify that high-resolution strategies are distinguished from general multi-scale methods by their explicit handling of large image dimensions.

Among twenty-seven candidates examined, the self-distillation framework with global and local pseudo-labels shows one refutable candidate out of seven examined, suggesting some overlap with prior distillation or multi-resolution supervision techniques. The hybrid dual-path encoder examined ten candidates with none refuting, indicating less direct precedent in the limited search. The integration into DepthAnything-v2 and MoGe2 also examined ten candidates with no refutations, though this may reflect the specificity of these particular model integrations rather than fundamental novelty. The analysis acknowledges this is a top-K semantic search, not an exhaustive survey.

Given the limited search scope of twenty-seven candidates, the dual-path architecture and integration contributions appear to have fewer documented precedents, while the self-distillation approach encounters at least one overlapping prior work. The taxonomy context suggests the paper addresses an active but not saturated research direction, where efficient high-resolution processing remains an open challenge. A more comprehensive literature review would be needed to assess novelty conclusively, particularly for the distillation framework.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: monocular geometry estimation from high-resolution images. The field encompasses a diverse set of approaches organized into several major branches. Depth Estimation Methods focus on predicting metric or relative depth from single views, with specialized strategies for handling high-resolution inputs and general-purpose scenarios. Multi-Attribute Geometry Estimation extends beyond depth to jointly predict surface normals, occlusion boundaries, and other geometric cues, as seen in works like GeoWizard[8]. 3D Shape Reconstruction from Single Images targets full volumetric or mesh recovery, often leveraging implicit representations or generative priors, exemplified by methods such as Make-It-3D[3] and Wonder3D[10]. Dynamic and Temporal Geometry Reconstruction addresses moving scenes and video sequences, while Geometry-Guided Vision Tasks apply estimated geometry to downstream applications like novel view synthesis and scene understanding. Specialized Reconstruction Domains tackle niche settings such as human digitization (Pifuhd[6], 2K Human Digitization[7]), building elevation estimation, and remote sensing contexts. Within Depth Estimation Methods, a particularly active line of work explores high-resolution processing strategies to overcome memory and computational bottlenecks. Hyden[0] sits squarely in this cluster, emphasizing efficient handling of large images through hierarchical or patch-based decomposition. Nearby approaches like PatchFusion[15] and PatchRefiner[20] similarly adopt local-to-global fusion schemes, trading off between fine-grained detail and global consistency. In contrast, Content-Adaptive Merging[5] and HR-Depth[30] propose alternative tiling or adaptive sampling mechanisms. A central challenge across these methods is balancing resolution fidelity with inference speed and memory footprint, while maintaining metric accuracy at scale. Hyden[0] contributes to this ongoing dialogue by refining how hierarchical representations can preserve both local sharpness and scene-level coherence, positioning itself among recent efforts to make monocular depth estimation practical for megapixel imagery.

Claimed Contributions

Hyden: Hybrid dual-path vision encoder architecture

10 retrieved papers

The authors propose a novel encoder architecture that pairs a low-resolution Vision Transformer branch for capturing global context with a full-resolution CNN branch for preserving fine details. By constraining the ViT to fixed resolution and exploiting the linear scaling of CNNs, the model achieves efficient inference on multi-megapixel inputs while maintaining accuracy.

10 retrieved papers

Self-distillation framework with global and local pseudo-labels

Can Refute

7 retrieved papers

The authors introduce a training framework that generates pseudo-labels from existing models at both lower-resolution full images (for geometric accuracy) and high-resolution crops (for sharper details). This approach overcomes the scarcity of high-quality high-resolution supervision without relying on real or synthetic ground truth.

7 retrieved papers

Can Refute

State-of-the-art integration into DepthAnything-v2 and MoGe2

10 retrieved papers

The authors demonstrate the flexibility and effectiveness of their approach by integrating Hyden and the self-distillation method into two leading geometry estimation models. This integration achieves state-of-the-art results on high-resolution benchmarks while significantly reducing inference latency compared to original models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging PDF

Miangoleh, S. Mahdi H., S. Mahdi H. Miangoleh, Dille, Sebastian, Sebastian Dille, Mai, Long, Long Mai, Paris Sylvain, Sylvain Paris, Aksoy, YaÄÄ±z, YaÄÄ±z Aksoy (2021)

[15] PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation PDF

Zhenyu Li, Shariq Farooq Bhat, Peter Wonka (2024)

[20] PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation PDF

Zhenyu Li, Shariq Farooq Bhat, Peter Wonka (2024) • European Conference on Computer Vision

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Hyden: Hybrid dual-path vision encoder architecture

[58] A survey of the vision transformers and their CNN-transformer based variants PDF

Cannot Refute

[59] Global context vision transformers PDF

Cannot Refute

[60] Systematic review of hybrid vision transformer architectures for radiological image analysis PDF

Cannot Refute

[61] CMT: Convolutional Neural Networks Meet Vision Transformers PDF

Cannot Refute

[62] Do Vision Transformers See Like Convolutional Neural Networks? PDF

Cannot Refute

[63] Vit-comer: Vision transformer with convolutional multi-scale feature interaction for dense predictions PDF

Cannot Refute

[64] Inception convolutional vision transformers for plant disease identification PDF

Cannot Refute

[65] Cervical cancer detection: A comprehensive evaluation of CNN models, vision transformer approaches, and fusion strategies PDF

Cannot Refute

[66] CvT: Introducing Convolutions to Vision Transformers PDF

Cannot Refute

[67] Landslide Susceptibility Mapping Considering Landslide Local-Global Features Based on CNN and Transformer PDF

Cannot Refute

Contribution

Self-distillation framework with global and local pseudo-labels

[54] Self-supervised Learning of Depth Inference for Multi-view Stereo PDF

Can Refute

[51] Knowledge distillation of multi-scale dense prediction transformer for self-supervised depth estimation PDF

Cannot Refute

[52] RGB-Based VisualâInertial Odometry via Knowledge Distillation from Self-Supervised Depth Estimation with Foundation Models PDF

Cannot Refute

[53] Open vocabulary 3d scene understanding via geometry guided self-distillation PDF

Cannot Refute

[55] GasMono: Geometry-Aided Self-Supervised Monocular Depth Estimation for Indoor Scenes PDF

Cannot Refute

[56] Self-Distilled Self-Supervised Monocular Depth Estimation PDF

Cannot Refute

[57] Progressive Target Refinement by Self-distillation for Human Pose Estimation PDF

Cannot Refute

Contribution

State-of-the-art integration into DepthAnything-v2 and MoGe2

[5] Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging PDF

Cannot Refute

[35] Dens3R: A Foundation Model for 3D Geometry Prediction PDF

Cannot Refute

[68] Geonet: Geometric neural network for joint depth and surface normal estimation PDF

Cannot Refute

[69] Sapiens: Foundation for Human Vision Models PDF

Cannot Refute

[70] iDisc: Internal Discretization for Monocular Depth Estimation PDF

Cannot Refute

[71] WorldMirror: Universal 3D World Reconstruction with Any-Prior Prompting PDF

Cannot Refute

[72] SN360: semantic and surface normal cascaded multi-task 360 monocular depth estimation PDF

Cannot Refute

[73] Adaptive surface normal constraint for geometric estimation from monocular images PDF

Cannot Refute

[74] Enforcing geometric constraints of virtual normal for depth prediction PDF

Cannot Refute

[75] M3Depth: Wavelet-Enhanced Depth Estimation on Mars via Mutual Boosting of Dual-Modal Data PDF

Cannot Refute

Hyden: A Hybrid Dual-Path Encoder for Monocular Geometry of High-resolution Images

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging PDF

[15] PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation PDF

[20] PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation PDF

Contribution Analysis

Hyden: Hybrid dual-path vision encoder architecture

[58] A survey of the vision transformers and their CNN-transformer based variants PDF

[59] Global context vision transformers PDF

[60] Systematic review of hybrid vision transformer architectures for radiological image analysis PDF

[61] CMT: Convolutional Neural Networks Meet Vision Transformers PDF

[62] Do Vision Transformers See Like Convolutional Neural Networks? PDF

[63] Vit-comer: Vision transformer with convolutional multi-scale feature interaction for dense predictions PDF

[64] Inception convolutional vision transformers for plant disease identification PDF

[65] Cervical cancer detection: A comprehensive evaluation of CNN models, vision transformer approaches, and fusion strategies PDF

[66] CvT: Introducing Convolutions to Vision Transformers PDF

[67] Landslide Susceptibility Mapping Considering Landslide Local-Global Features Based on CNN and Transformer PDF

Self-distillation framework with global and local pseudo-labels

[54] Self-supervised Learning of Depth Inference for Multi-view Stereo PDF

[51] Knowledge distillation of multi-scale dense prediction transformer for self-supervised depth estimation PDF

[52] RGB-Based VisualâInertial Odometry via Knowledge Distillation from Self-Supervised Depth Estimation with Foundation Models PDF

[53] Open vocabulary 3d scene understanding via geometry guided self-distillation PDF

[55] GasMono: Geometry-Aided Self-Supervised Monocular Depth Estimation for Indoor Scenes PDF

[56] Self-Distilled Self-Supervised Monocular Depth Estimation PDF

[57] Progressive Target Refinement by Self-distillation for Human Pose Estimation PDF

State-of-the-art integration into DepthAnything-v2 and MoGe2

[5] Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging PDF

[35] Dens3R: A Foundation Model for 3D Geometry Prediction PDF

[68] Geonet: Geometric neural network for joint depth and surface normal estimation PDF

[69] Sapiens: Foundation for Human Vision Models PDF

[70] iDisc: Internal Discretization for Monocular Depth Estimation PDF

[71] WorldMirror: Universal 3D World Reconstruction with Any-Prior Prompting PDF

[72] SN360: semantic and surface normal cascaded multi-task 360 monocular depth estimation PDF

[73] Adaptive surface normal constraint for geometric estimation from monocular images PDF

[74] Enforcing geometric constraints of virtual normal for depth prediction PDF

[75] M3Depth: Wavelet-Enhanced Depth Estimation on Mars via Mutual Boosting of Dual-Modal Data PDF

Table of Contents

[52] RGB-Based VisualâInertial Odometry via Knowledge Distillation from Self-Supervised Depth Estimation with Foundation Models PDF