Monocular Normal Estimation via Shading Sequence Estimation

ICLR 2026 Conference SubmissionAnonymous Authors
Video Diffusion ModelShading EstimationSingle-view Normal Estimation
Abstract:

Monocular normal estimation aims to estimate normal map from a single RGB image of an object under arbitrary lighting. Existing methods rely on deep models to directly predict normal maps. However, they often suffer from 3D misalignment: while the estimated normal maps may appear to have an overall correct color distribution, the reconstructed surfaces frequently fail to align with the geometry details. We argue that this misalignment stems from the current paradigm: the model struggles to distinguish and reconstruct spatially-various geometric, as they are represented in normal maps only by relatively subtle color variations. To address this issue, we propose a new paradigm that reformulates normal estimation as shading sequence estimation, where shading sequences are more sensitive to various geometry information. Building on this paradigm, we present RoSE, a method that leverages image-to-video generative models to predict shading sequences. The predicted shading sequences are then converted into normal maps by solving a simple ordinary least-squares problem. To enhance robustness and better handle complex objects, RoSE is trained on a synthetic dataset, dataset, with diverse shapes, materials, and light conditions. Experiments demonstrate that RoSE achieves state-of-the-art performance on real-world benchmark datasets for object-based monocular normal estimation. Codes and dataset will be released to facilitate reproducible research.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes reformulating monocular normal estimation as shading sequence estimation, arguing that shading sequences better capture geometric variations than direct normal prediction. According to the taxonomy, this work resides in the 'Shading-Based Normal Estimation' leaf under 'Novel Paradigms and Representations', where it is currently the only paper. This isolation suggests the shading-sequence paradigm represents a relatively unexplored direction within a field that has primarily focused on direct regression or joint depth-normal estimation. The sparse population of this leaf contrasts with more crowded branches like 'Encoder-Decoder and Multi-Scale Architectures' or 'Diffusion-Based Geometry Generation', indicating the approach occupies a niche position.

The taxonomy reveals that neighboring research directions emphasize different modeling philosophies. The sibling leaf 'Canonical Frame and Local Coordinate Estimation' explores local coordinate systems rather than shading cues, while the broader 'Generative and Foundation Models' branch leverages large-scale pre-training without explicit physical modeling. The 'Data-Driven Deep Learning Approaches' branch, containing multiple encoder-decoder architectures, treats RGB pixels as generic features rather than decomposing them into shading components. The paper's focus on shading sequences bridges classical photometric techniques with modern generative models, positioning it between physics-based reasoning and data-driven learning—a boundary less explored than either extreme.

Among the three contributions analyzed across 18 candidate papers, the shading-sequence paradigm examined 2 candidates with no clear refutations, suggesting limited prior work on this specific formulation. The RoSE method using image-to-video models examined 6 candidates without refutation, indicating the application of video generation to normal estimation may be relatively novel. However, the MultiShade synthetic dataset contribution examined 10 candidates and found 2 refutable instances, suggesting synthetic datasets with diverse materials and lighting are more established in the field. The limited search scope (18 candidates total) means these findings reflect top-K semantic matches rather than exhaustive coverage.

Based on the examined candidates, the shading-sequence reformulation and video-generation approach appear less explored than synthetic dataset creation. The taxonomy structure confirms the paper occupies a sparse research direction, though the small search scope (18 papers) and single-paper leaf status warrant caution. The analysis captures semantic neighbors but cannot rule out relevant work outside the top-K matches or in adjacent computer vision subfields like photometric stereo or intrinsic image decomposition.

Taxonomy

Core-task Taxonomy Papers
44
3
Claimed Contributions
18
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: monocular normal estimation from single RGB images. The field has evolved into several major branches that reflect different modeling philosophies and application contexts. Joint Depth and Normal Estimation approaches treat geometry prediction as a coupled problem, leveraging shared representations to improve both outputs. Generative and Foundation Models for Geometry Estimation harness large-scale pre-training and diffusion-based architectures to produce robust predictions across diverse scenes, as seen in works like GeoWizard[4] and Metric3D v2[3]. Structured Scene Understanding emphasizes semantic and geometric priors—such as planar constraints or Manhattan-world assumptions—to guide normal prediction in indoor or urban environments. Data-Driven Deep Learning Approaches focus on end-to-end architectures and loss formulations that directly optimize normal accuracy, while Specialized Application Domains target specific use cases like autonomous driving or cloth capture. Multi-Modal and Sensor Fusion Methods integrate additional cues such as LiDAR or polarization data to refine estimates. Novel Paradigms and Representations explore alternative input modalities or intermediate representations, and 3D Reconstruction and Scene Understanding Integration connects normal estimation to broader reconstruction pipelines. Within Novel Paradigms and Representations, a small cluster of works investigates unconventional input signals or intermediate features. Shading Sequence Normal[0] sits in this branch, emphasizing shading cues as a primary source of geometric information—an approach that contrasts with purely data-driven methods that treat RGB pixels as generic features. This focus on shading-based reasoning aligns with classical photometric techniques but leverages modern learning frameworks. Nearby, Adaptive Surface Normal[1] explores adaptive mechanisms for handling varying surface properties, while Polarimetric Leaf Normal[2] demonstrates how polarization can complement RGB for specific materials. The original paper's emphasis on shading sequences distinguishes it from foundation models like GeoWizard[4] or Wonder3D[5], which rely on large-scale pre-training rather than explicit physical modeling. This positioning highlights an ongoing tension in the field: whether to exploit domain-specific cues or to scale generic architectures.

Claimed Contributions

New paradigm reformulating normal estimation as shading sequence estimation

The authors propose a paradigm shift where monocular normal estimation is reformulated as predicting a shading sequence under canonical lights, which is more sensitive to geometric variations than directly predicting normal maps. This addresses the 3D misalignment problem in existing methods.

2 retrieved papers
RoSE method using image-to-video generative model for shading sequence prediction

RoSE leverages image-to-video generative models to predict shading sequences from a single grayscale image, then converts these sequences into normal maps using an ordinary least-squares solver. This approach achieves state-of-the-art performance on benchmark datasets.

6 retrieved papers
MultiShade synthetic dataset with diverse materials and lighting

The authors curate MultiShade, a large-scale synthetic dataset built on Objaverse models with material augmentation from MatSynth and diverse lighting conditions (parallel, point, and environment lights). This dataset improves model robustness and generalization to complex real-world scenarios.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

New paradigm reformulating normal estimation as shading sequence estimation

The authors propose a paradigm shift where monocular normal estimation is reformulated as predicting a shading sequence under canonical lights, which is more sensitive to geometric variations than directly predicting normal maps. This addresses the 3D misalignment problem in existing methods.

Contribution

RoSE method using image-to-video generative model for shading sequence prediction

RoSE leverages image-to-video generative models to predict shading sequences from a single grayscale image, then converts these sequences into normal maps using an ordinary least-squares solver. This approach achieves state-of-the-art performance on benchmark datasets.

Contribution

MultiShade synthetic dataset with diverse materials and lighting

The authors curate MultiShade, a large-scale synthetic dataset built on Objaverse models with material augmentation from MatSynth and diverse lighting conditions (parallel, point, and environment lights). This dataset improves model robustness and generalization to complex real-world scenarios.

Monocular Normal Estimation via Shading Sequence Estimation | Novelty Validation