CHROMA: Consistent Harmonization of Multi-View Appearance via Bilateral Grid Prediction

ICLR 2026 Conference SubmissionAnonymous Authors
Bilateral GridAppearance Harmonization3D Reconstruction
Abstract:

Modern camera pipelines apply extensive on-device processing, such as exposure adjustment, white balance, and color correction, which, while beneficial individually, often introduce photometric inconsistencies across views. These appearance variations violate multi-view consistency and degrade novel view synthesis. Joint optimization of scene-specific representations and per-image appearance embeddings has been proposed to address this issue, but with increased computational complexity and slower training. In this work, we propose a generalizable, feed-forward approach that predicts spatially adaptive bilateral grids to correct photometric variations in a multi-view consistent manner. Our model processes hundreds of frames in a single step, enabling efficient large-scale harmonization, and seamlessly integrates into downstream 3D reconstruction models, providing cross-scene generalization without requiring scene-specific retraining. To overcome the lack of paired data, we employ a hybrid self-supervised rendering loss leveraging 3D foundation models, improving generalization to real-world variations. Extensive experiments show that our approach outperforms or matches the reconstruction quality of existing scene-specific optimization methods with appearance modeling, without significantly affecting the training time of baseline 3D models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a feed-forward approach for multi-view appearance harmonization using spatially adaptive bilateral grids to correct photometric inconsistencies introduced by camera processing pipelines. It resides in the 'Appearance Harmonization and Photometric Consistency' leaf, which contains only four papers total, including this work. This is a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting that photometric harmonization for 3D reconstruction remains an underexplored area compared to more crowded branches like diffusion-based multi-view generation or geometric consistency enforcement.

The taxonomy reveals that neighboring research directions focus on geometric consistency via cross-view constraints (six papers) and neural rendering with multi-view consistency (six papers across NeRF and Gaussian splatting). The sibling papers in the same leaf address related but distinct aspects: generative multiview relighting, specular-to-diffuse translation, and other photometric challenges. The scope note explicitly excludes purely geometric methods, positioning this work at the intersection of appearance modeling and reconstruction rather than generative synthesis or traditional multi-view stereo, which occupy separate branches with substantially more papers.

Among the eleven candidates examined through limited semantic search, none clearly refute the three main contributions. The feed-forward bilateral grid prediction examined one candidate with no refutation. The hybrid self-supervised rendering loss using 3D foundation models examined seven candidates, all non-refutable or unclear. The multi-view aware transformer with bilateral confidence grids examined three candidates, similarly without clear prior overlap. This suggests that within the examined scope, the specific combination of techniques appears relatively novel, though the limited search scale (eleven total candidates) means substantial prior work outside this sample remains possible.

The analysis indicates the work occupies a sparse research niche with limited directly comparable prior art among examined candidates. However, the small search scope and the existence of only four papers in the taxonomy leaf suggest this assessment reflects top-K semantic matches rather than exhaustive coverage. The contribution's novelty appears strongest in the feed-forward bilateral grid formulation and integration with 3D foundation models, though broader literature beyond the examined candidates may contain relevant techniques.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
11
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Multi-view appearance harmonization for 3D reconstruction. The field addresses the challenge of ensuring consistent appearance across multiple viewpoints when reconstructing or generating 3D scenes and objects. The taxonomy reveals several major branches: diffusion-based multi-view generation methods that synthesize consistent novel views (e.g., MVDream[1], Vistadream[2]), reconstruction pipelines that explicitly enforce geometric and photometric consistency (e.g., Cross-view Transformer[3], Simvs[26]), neural rendering approaches that leverage implicit representations for view synthesis (e.g., NeRF-based methods), scene editing frameworks that maintain consistency during manipulation (e.g., Gaussctrl[7], View-Consistent Editing[13]), monocular reconstruction techniques that reason about multiple views from single images, and specialized applications targeting domains like human avatars or relighting. These branches reflect a spectrum from generative synthesis to reconstruction-driven consistency, with varying emphases on geometric accuracy versus appearance fidelity. Particularly active research directions include diffusion-guided generation, where works like Mvdiffusion++[4] and Bag of Views[5] explore different strategies for cross-view attention and noise synchronization, and appearance harmonization within reconstruction pipelines, where photometric consistency becomes critical. CHROMA[0] sits squarely within the appearance harmonization and photometric consistency cluster, focusing on reconciling lighting and material variations across input views—a challenge distinct from purely geometric multi-view stereo or generative synthesis. Compared to nearby works like Generative Multiview Relighting[33], which tackles relighting as a generative problem, or Specular-to-Diffuse Translation[36], which addresses material decomposition, CHROMA[0] emphasizes harmonizing existing captures rather than synthesizing new conditions. This positions it alongside classical photometric consistency methods while incorporating modern learning-based techniques to handle real-world appearance variability that traditional approaches struggle with.

Claimed Contributions

Feed-forward multi-view appearance harmonization via bilateral grid prediction

The authors introduce a generalizable feed-forward model that predicts spatially adaptive bilateral grids to harmonize photometric variations across multiple views in a consistent manner. This approach processes hundreds of frames in a single step and integrates into downstream 3D reconstruction models without requiring scene-specific retraining.

1 retrieved paper
Hybrid self-supervised rendering loss using 3D foundation models

To overcome the lack of paired training data, the authors develop a hybrid self-supervised rendering loss that leverages 3D foundation models. This training approach improves the model's ability to generalize to real-world appearance variations without requiring paired supervision.

7 retrieved papers
Multi-view aware transformer with bilateral confidence grids

The authors design a multi-view aware transformer architecture that predicts both bilateral grids for appearance transformation and bilateral confidence grids to make the model uncertainty-aware. This enables robust handling of varying appearance conditions across views.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Feed-forward multi-view appearance harmonization via bilateral grid prediction

The authors introduce a generalizable feed-forward model that predicts spatially adaptive bilateral grids to harmonize photometric variations across multiple views in a consistent manner. This approach processes hundreds of frames in a single step and integrates into downstream 3D reconstruction models without requiring scene-specific retraining.

Contribution

Hybrid self-supervised rendering loss using 3D foundation models

To overcome the lack of paired training data, the authors develop a hybrid self-supervised rendering loss that leverages 3D foundation models. This training approach improves the model's ability to generalize to real-world appearance variations without requiring paired supervision.

Contribution

Multi-view aware transformer with bilateral confidence grids

The authors design a multi-view aware transformer architecture that predicts both bilateral grids for appearance transformation and bilateral confidence grids to make the model uncertainty-aware. This enables robust handling of varying appearance conditions across views.