Monocular Normal Estimation via Shading Sequence Estimation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.4 Download Report PDF

Video Diffusion ModelShading EstimationSingle-view Normal Estimation

Monocular normal estimation aims to estimate normal map from a single RGB image of an object under arbitrary lighting. Existing methods rely on deep models to directly predict normal maps. However, they often suffer from 3D misalignment: while the estimated normal maps may appear to have an overall correct color distribution, the reconstructed surfaces frequently fail to align with the geometry details. We argue that this misalignment stems from the current paradigm: the model struggles to distinguish and reconstruct spatially-various geometric, as they are represented in normal maps only by relatively subtle color variations. To address this issue, we propose a new paradigm that reformulates normal estimation as shading sequence estimation, where shading sequences are more sensitive to various geometry information. Building on this paradigm, we present RoSE, a method that leverages image-to-video generative models to predict shading sequences. The predicted shading sequences are then converted into normal maps by solving a simple ordinary least-squares problem. To enhance robustness and better handle complex objects, RoSE is trained on a synthetic dataset, dataset, with diverse shapes, materials, and light conditions. Experiments demonstrate that RoSE achieves state-of-the-art performance on real-world benchmark datasets for object-based monocular normal estimation. Codes and dataset will be released to facilitate reproducible research.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes reformulating monocular normal estimation as shading sequence estimation, arguing that shading sequences better capture geometric variations than direct normal prediction. According to the taxonomy, this work resides in the 'Shading-Based Normal Estimation' leaf under 'Novel Paradigms and Representations', where it is currently the only paper. This isolation suggests the shading-sequence paradigm represents a relatively unexplored direction within a field that has primarily focused on direct regression or joint depth-normal estimation. The sparse population of this leaf contrasts with more crowded branches like 'Encoder-Decoder and Multi-Scale Architectures' or 'Diffusion-Based Geometry Generation', indicating the approach occupies a niche position.

The taxonomy reveals that neighboring research directions emphasize different modeling philosophies. The sibling leaf 'Canonical Frame and Local Coordinate Estimation' explores local coordinate systems rather than shading cues, while the broader 'Generative and Foundation Models' branch leverages large-scale pre-training without explicit physical modeling. The 'Data-Driven Deep Learning Approaches' branch, containing multiple encoder-decoder architectures, treats RGB pixels as generic features rather than decomposing them into shading components. The paper's focus on shading sequences bridges classical photometric techniques with modern generative models, positioning it between physics-based reasoning and data-driven learning—a boundary less explored than either extreme.

Among the three contributions analyzed across 18 candidate papers, the shading-sequence paradigm examined 2 candidates with no clear refutations, suggesting limited prior work on this specific formulation. The RoSE method using image-to-video models examined 6 candidates without refutation, indicating the application of video generation to normal estimation may be relatively novel. However, the MultiShade synthetic dataset contribution examined 10 candidates and found 2 refutable instances, suggesting synthetic datasets with diverse materials and lighting are more established in the field. The limited search scope (18 candidates total) means these findings reflect top-K semantic matches rather than exhaustive coverage.

Based on the examined candidates, the shading-sequence reformulation and video-generation approach appear less explored than synthetic dataset creation. The taxonomy structure confirms the paper occupies a sparse research direction, though the small search scope (18 papers) and single-paper leaf status warrant caution. The analysis captures semantic neighbors but cannot rule out relevant work outside the top-K matches or in adjacent computer vision subfields like photometric stereo or intrinsic image decomposition.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: monocular normal estimation from single RGB images. The field has evolved into several major branches that reflect different modeling philosophies and application contexts. Joint Depth and Normal Estimation approaches treat geometry prediction as a coupled problem, leveraging shared representations to improve both outputs. Generative and Foundation Models for Geometry Estimation harness large-scale pre-training and diffusion-based architectures to produce robust predictions across diverse scenes, as seen in works like GeoWizard[4] and Metric3D v2[3]. Structured Scene Understanding emphasizes semantic and geometric priors—such as planar constraints or Manhattan-world assumptions—to guide normal prediction in indoor or urban environments. Data-Driven Deep Learning Approaches focus on end-to-end architectures and loss formulations that directly optimize normal accuracy, while Specialized Application Domains target specific use cases like autonomous driving or cloth capture. Multi-Modal and Sensor Fusion Methods integrate additional cues such as LiDAR or polarization data to refine estimates. Novel Paradigms and Representations explore alternative input modalities or intermediate representations, and 3D Reconstruction and Scene Understanding Integration connects normal estimation to broader reconstruction pipelines. Within Novel Paradigms and Representations, a small cluster of works investigates unconventional input signals or intermediate features. Shading Sequence Normal[0] sits in this branch, emphasizing shading cues as a primary source of geometric information—an approach that contrasts with purely data-driven methods that treat RGB pixels as generic features. This focus on shading-based reasoning aligns with classical photometric techniques but leverages modern learning frameworks. Nearby, Adaptive Surface Normal[1] explores adaptive mechanisms for handling varying surface properties, while Polarimetric Leaf Normal[2] demonstrates how polarization can complement RGB for specific materials. The original paper's emphasis on shading sequences distinguishes it from foundation models like GeoWizard[4] or Wonder3D[5], which rely on large-scale pre-training rather than explicit physical modeling. This positioning highlights an ongoing tension in the field: whether to exploit domain-specific cues or to scale generic architectures.

Claimed Contributions

New paradigm reformulating normal estimation as shading sequence estimation

2 retrieved papers

The authors propose a paradigm shift where monocular normal estimation is reformulated as predicting a shading sequence under canonical lights, which is more sensitive to geometric variations than directly predicting normal maps. This addresses the 3D misalignment problem in existing methods.

2 retrieved papers

RoSE method using image-to-video generative model for shading sequence prediction

6 retrieved papers

RoSE leverages image-to-video generative models to predict shading sequences from a single grayscale image, then converts these sequences into normal maps using an ordinary least-squares solver. This approach achieves state-of-the-art performance on benchmark datasets.

6 retrieved papers

MultiShade synthetic dataset with diverse materials and lighting

Can Refute

10 retrieved papers

The authors curate MultiShade, a large-scale synthetic dataset built on Objaverse models with material augmentation from MatSynth and diverse lighting conditions (parallel, point, and environment lights). This dataset improves model robustness and generalization to complex real-world scenarios.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

New paradigm reformulating normal estimation as shading sequence estimation

[61] Learning single-image 3d reconstruction by generative modelling of shape, pose and shading PDF

Cannot Refute

[62] Recovering facial shape using a statistical model of surface normal direction PDF

Cannot Refute

Contribution

RoSE method using image-to-video generative model for shading sequence prediction

[45] Illumination and color in computer generated imagery PDF

Cannot Refute

[46] GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion PDF

Cannot Refute

[47] Static scene illumination estimation from videos with applications PDF

Cannot Refute

[48] Generative AI for 2.5 D Content Creation with Depth-Guided Object Placement PDF

Cannot Refute

[49] Face illumination normalization with shadow consideration PDF

Cannot Refute

[50] A Shading-Guided Generative Implicit Model for Shape-Accurate 3D-Aware Image Synthesis PDF

Cannot Refute

Contribution

MultiShade synthetic dataset with diverse materials and lighting

[53] Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion PDF

Can Refute

[59] Physically-based rendering for indoor scene understanding using convolutional neural networks PDF

Can Refute

[51] SynthOutdoor: A synthetic dataset for 3D outdoor light estimation PDF

Cannot Refute

[52] MERLiN: Single-Shot Material Estimation and Relighting for Photometric Stereo PDF

Cannot Refute

[54] Factorized Inverse Path Tracing for Efficient and Accurate Material-Lighting Estimation PDF

Cannot Refute

[55] IRS: A Large Synthetic Indoor Robotics Stereo Dataset for Disparity and Surface Normal Estimation PDF

Cannot Refute

[56] Normalizing images in various weather and lighting conditions using ColorPix2Pix generative adversarial network PDF

Cannot Refute

[57] Cross-Domain Synthetic-to-Real In-the-Wild Depth and Normal Estimation for 3D Scene Understanding PDF

Cannot Refute

[58] SNE-RoadSeg: Incorporating Surface Normal Information into Semantic Segmentation for Accurate Freespace Detection PDF

Cannot Refute

[60] SfPUEL: Shape from Polarization under Unknown Environment Light PDF

Cannot Refute

Monocular Normal Estimation via Shading Sequence Estimation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

New paradigm reformulating normal estimation as shading sequence estimation

[61] Learning single-image 3d reconstruction by generative modelling of shape, pose and shading PDF

[62] Recovering facial shape using a statistical model of surface normal direction PDF

RoSE method using image-to-video generative model for shading sequence prediction

[45] Illumination and color in computer generated imagery PDF

[46] GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion PDF

[47] Static scene illumination estimation from videos with applications PDF

[48] Generative AI for 2.5 D Content Creation with Depth-Guided Object Placement PDF

[49] Face illumination normalization with shadow consideration PDF

[50] A Shading-Guided Generative Implicit Model for Shape-Accurate 3D-Aware Image Synthesis PDF

MultiShade synthetic dataset with diverse materials and lighting

[53] Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion PDF

[59] Physically-based rendering for indoor scene understanding using convolutional neural networks PDF

[51] SynthOutdoor: A synthetic dataset for 3D outdoor light estimation PDF

[52] MERLiN: Single-Shot Material Estimation and Relighting for Photometric Stereo PDF

[54] Factorized Inverse Path Tracing for Efficient and Accurate Material-Lighting Estimation PDF

[55] IRS: A Large Synthetic Indoor Robotics Stereo Dataset for Disparity and Surface Normal Estimation PDF

[56] Normalizing images in various weather and lighting conditions using ColorPix2Pix generative adversarial network PDF

[57] Cross-Domain Synthetic-to-Real In-the-Wild Depth and Normal Estimation for 3D Scene Understanding PDF

[58] SNE-RoadSeg: Incorporating Surface Normal Information into Semantic Segmentation for Accurate Freespace Detection PDF

[60] SfPUEL: Shape from Polarization under Unknown Environment Light PDF

Table of Contents