Hyden: A Hybrid Dual-Path Encoder for Monocular Geometry of High-resolution Images

ICLR 2026 Conference SubmissionAnonymous Authors
3D visionMonocular Depth EstimationMonocular Surface Normal Estimation
Abstract:

We present a hybrid dual-path vision encoder (Hyden) for high-resolution monocular depth, point map and surface normal estimation, surpassing state-of-the-art accuracy with a fraction of the inference cost. The architecture pairs a low-resolution Vision Transformer branch for global context with a full-resolution CNN branch for fine details, fusing features via a lightweight MLP before decoding. By exploiting the linear scaling of CNNs and constraining transformer computation to a fixed resolution, the model delivers fast inference even on multi-megapixel inputs. To overcome the scarcity of high-quality high-resolution supervision, we introduce a self-distillation framework that generates pseudo-labels from existing models at both lower resolution full images and high-resolution crops—global labels preserve geometric accuracy, while local labels capture sharper details. To demonstrate the flexibility of our approach, we integrate Hyden and our self-distillation method into DepthAnything-v2 for depth estimation and MoGe2 for surface normal and metric point map prediction, achieving state-of-the-art results on high-resolution benchmarks with the lowest inference latency among competing methods.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a hybrid dual-path architecture pairing a low-resolution Vision Transformer branch with a full-resolution CNN branch for depth, normal, and point map estimation. It resides in the 'High-Resolution Processing Strategies' leaf, which contains four papers addressing tiling, patching, or adaptive merging for high-resolution depth. This leaf is part of the broader 'General-Purpose Depth Prediction' branch, indicating a moderately active research direction focused on overcoming memory and computational bottlenecks in megapixel inputs. The taxonomy reveals this is neither a sparse nor overcrowded niche.

The taxonomy tree shows neighboring leaves including 'Multi-Scale and Coarse-to-Fine Architectures' (four papers) and 'Fast and Efficient Architectures' (one paper), both exploring hierarchical feature fusion but without explicit high-resolution focus. The 'Multi-Attribute Geometry Estimation' branch (two papers) addresses joint prediction of depth, normals, and point maps, closely related to the paper's multi-output formulation. The 'Metric and Zero-Shot Depth Estimation' leaf (two papers) tackles scale recovery, a concern also relevant here. The taxonomy's scope and exclude notes clarify that high-resolution strategies are distinguished from general multi-scale methods by their explicit handling of large image dimensions.

Among twenty-seven candidates examined, the self-distillation framework with global and local pseudo-labels shows one refutable candidate out of seven examined, suggesting some overlap with prior distillation or multi-resolution supervision techniques. The hybrid dual-path encoder examined ten candidates with none refuting, indicating less direct precedent in the limited search. The integration into DepthAnything-v2 and MoGe2 also examined ten candidates with no refutations, though this may reflect the specificity of these particular model integrations rather than fundamental novelty. The analysis acknowledges this is a top-K semantic search, not an exhaustive survey.

Given the limited search scope of twenty-seven candidates, the dual-path architecture and integration contributions appear to have fewer documented precedents, while the self-distillation approach encounters at least one overlapping prior work. The taxonomy context suggests the paper addresses an active but not saturated research direction, where efficient high-resolution processing remains an open challenge. A more comprehensive literature review would be needed to assess novelty conclusively, particularly for the distillation framework.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
27
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: monocular geometry estimation from high-resolution images. The field encompasses a diverse set of approaches organized into several major branches. Depth Estimation Methods focus on predicting metric or relative depth from single views, with specialized strategies for handling high-resolution inputs and general-purpose scenarios. Multi-Attribute Geometry Estimation extends beyond depth to jointly predict surface normals, occlusion boundaries, and other geometric cues, as seen in works like GeoWizard[8]. 3D Shape Reconstruction from Single Images targets full volumetric or mesh recovery, often leveraging implicit representations or generative priors, exemplified by methods such as Make-It-3D[3] and Wonder3D[10]. Dynamic and Temporal Geometry Reconstruction addresses moving scenes and video sequences, while Geometry-Guided Vision Tasks apply estimated geometry to downstream applications like novel view synthesis and scene understanding. Specialized Reconstruction Domains tackle niche settings such as human digitization (Pifuhd[6], 2K Human Digitization[7]), building elevation estimation, and remote sensing contexts. Within Depth Estimation Methods, a particularly active line of work explores high-resolution processing strategies to overcome memory and computational bottlenecks. Hyden[0] sits squarely in this cluster, emphasizing efficient handling of large images through hierarchical or patch-based decomposition. Nearby approaches like PatchFusion[15] and PatchRefiner[20] similarly adopt local-to-global fusion schemes, trading off between fine-grained detail and global consistency. In contrast, Content-Adaptive Merging[5] and HR-Depth[30] propose alternative tiling or adaptive sampling mechanisms. A central challenge across these methods is balancing resolution fidelity with inference speed and memory footprint, while maintaining metric accuracy at scale. Hyden[0] contributes to this ongoing dialogue by refining how hierarchical representations can preserve both local sharpness and scene-level coherence, positioning itself among recent efforts to make monocular depth estimation practical for megapixel imagery.

Claimed Contributions

Hyden: Hybrid dual-path vision encoder architecture

The authors propose a novel encoder architecture that pairs a low-resolution Vision Transformer branch for capturing global context with a full-resolution CNN branch for preserving fine details. By constraining the ViT to fixed resolution and exploiting the linear scaling of CNNs, the model achieves efficient inference on multi-megapixel inputs while maintaining accuracy.

10 retrieved papers
Self-distillation framework with global and local pseudo-labels

The authors introduce a training framework that generates pseudo-labels from existing models at both lower-resolution full images (for geometric accuracy) and high-resolution crops (for sharper details). This approach overcomes the scarcity of high-quality high-resolution supervision without relying on real or synthetic ground truth.

7 retrieved papers
Can Refute
State-of-the-art integration into DepthAnything-v2 and MoGe2

The authors demonstrate the flexibility and effectiveness of their approach by integrating Hyden and the self-distillation method into two leading geometry estimation models. This integration achieves state-of-the-art results on high-resolution benchmarks while significantly reducing inference latency compared to original models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Hyden: Hybrid dual-path vision encoder architecture

The authors propose a novel encoder architecture that pairs a low-resolution Vision Transformer branch for capturing global context with a full-resolution CNN branch for preserving fine details. By constraining the ViT to fixed resolution and exploiting the linear scaling of CNNs, the model achieves efficient inference on multi-megapixel inputs while maintaining accuracy.

Contribution

Self-distillation framework with global and local pseudo-labels

The authors introduce a training framework that generates pseudo-labels from existing models at both lower-resolution full images (for geometric accuracy) and high-resolution crops (for sharper details). This approach overcomes the scarcity of high-quality high-resolution supervision without relying on real or synthetic ground truth.

Contribution

State-of-the-art integration into DepthAnything-v2 and MoGe2

The authors demonstrate the flexibility and effectiveness of their approach by integrating Hyden and the self-distillation method into two leading geometry estimation models. This integration achieves state-of-the-art results on high-resolution benchmarks while significantly reducing inference latency compared to original models.

Hyden: A Hybrid Dual-Path Encoder for Monocular Geometry of High-resolution Images | Novelty Validation