CHROMA: Consistent Harmonization of Multi-View Appearance via Bilateral Grid Prediction

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Bilateral GridAppearance Harmonization3D Reconstruction

Modern camera pipelines apply extensive on-device processing, such as exposure adjustment, white balance, and color correction, which, while beneficial individually, often introduce photometric inconsistencies across views. These appearance variations violate multi-view consistency and degrade novel view synthesis. Joint optimization of scene-specific representations and per-image appearance embeddings has been proposed to address this issue, but with increased computational complexity and slower training. In this work, we propose a generalizable, feed-forward approach that predicts spatially adaptive bilateral grids to correct photometric variations in a multi-view consistent manner. Our model processes hundreds of frames in a single step, enabling efficient large-scale harmonization, and seamlessly integrates into downstream 3D reconstruction models, providing cross-scene generalization without requiring scene-specific retraining. To overcome the lack of paired data, we employ a hybrid self-supervised rendering loss leveraging 3D foundation models, improving generalization to real-world variations. Extensive experiments show that our approach outperforms or matches the reconstruction quality of existing scene-specific optimization methods with appearance modeling, without significantly affecting the training time of baseline 3D models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a feed-forward approach for multi-view appearance harmonization using spatially adaptive bilateral grids to correct photometric inconsistencies introduced by camera processing pipelines. It resides in the 'Appearance Harmonization and Photometric Consistency' leaf, which contains only four papers total, including this work. This is a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting that photometric harmonization for 3D reconstruction remains an underexplored area compared to more crowded branches like diffusion-based multi-view generation or geometric consistency enforcement.

The taxonomy reveals that neighboring research directions focus on geometric consistency via cross-view constraints (six papers) and neural rendering with multi-view consistency (six papers across NeRF and Gaussian splatting). The sibling papers in the same leaf address related but distinct aspects: generative multiview relighting, specular-to-diffuse translation, and other photometric challenges. The scope note explicitly excludes purely geometric methods, positioning this work at the intersection of appearance modeling and reconstruction rather than generative synthesis or traditional multi-view stereo, which occupy separate branches with substantially more papers.

Among the eleven candidates examined through limited semantic search, none clearly refute the three main contributions. The feed-forward bilateral grid prediction examined one candidate with no refutation. The hybrid self-supervised rendering loss using 3D foundation models examined seven candidates, all non-refutable or unclear. The multi-view aware transformer with bilateral confidence grids examined three candidates, similarly without clear prior overlap. This suggests that within the examined scope, the specific combination of techniques appears relatively novel, though the limited search scale (eleven total candidates) means substantial prior work outside this sample remains possible.

The analysis indicates the work occupies a sparse research niche with limited directly comparable prior art among examined candidates. However, the small search scope and the existence of only four papers in the taxonomy leaf suggest this assessment reflects top-K semantic matches rather than exhaustive coverage. The contribution's novelty appears strongest in the feed-forward bilateral grid formulation and integration with 3D foundation models, though broader literature beyond the examined candidates may contain relevant techniques.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Multi-view appearance harmonization for 3D reconstruction. The field addresses the challenge of ensuring consistent appearance across multiple viewpoints when reconstructing or generating 3D scenes and objects. The taxonomy reveals several major branches: diffusion-based multi-view generation methods that synthesize consistent novel views (e.g., MVDream[1], Vistadream[2]), reconstruction pipelines that explicitly enforce geometric and photometric consistency (e.g., Cross-view Transformer[3], Simvs[26]), neural rendering approaches that leverage implicit representations for view synthesis (e.g., NeRF-based methods), scene editing frameworks that maintain consistency during manipulation (e.g., Gaussctrl[7], View-Consistent Editing[13]), monocular reconstruction techniques that reason about multiple views from single images, and specialized applications targeting domains like human avatars or relighting. These branches reflect a spectrum from generative synthesis to reconstruction-driven consistency, with varying emphases on geometric accuracy versus appearance fidelity. Particularly active research directions include diffusion-guided generation, where works like Mvdiffusion++[4] and Bag of Views[5] explore different strategies for cross-view attention and noise synchronization, and appearance harmonization within reconstruction pipelines, where photometric consistency becomes critical. CHROMA[0] sits squarely within the appearance harmonization and photometric consistency cluster, focusing on reconciling lighting and material variations across input views—a challenge distinct from purely geometric multi-view stereo or generative synthesis. Compared to nearby works like Generative Multiview Relighting[33], which tackles relighting as a generative problem, or Specular-to-Diffuse Translation[36], which addresses material decomposition, CHROMA[0] emphasizes harmonizing existing captures rather than synthesizing new conditions. This positions it alongside classical photometric consistency methods while incorporating modern learning-based techniques to handle real-world appearance variability that traditional approaches struggle with.

Claimed Contributions

Feed-forward multi-view appearance harmonization via bilateral grid prediction

1 retrieved paper

The authors introduce a generalizable feed-forward model that predicts spatially adaptive bilateral grids to harmonize photometric variations across multiple views in a consistent manner. This approach processes hundreds of frames in a single step and integrates into downstream 3D reconstruction models without requiring scene-specific retraining.

1 retrieved paper

Hybrid self-supervised rendering loss using 3D foundation models

7 retrieved papers

To overcome the lack of paired training data, the authors develop a hybrid self-supervised rendering loss that leverages 3D foundation models. This training approach improves the model's ability to generalize to real-world appearance variations without requiring paired supervision.

7 retrieved papers

Multi-view aware transformer with bilateral confidence grids

3 retrieved papers

The authors design a multi-view aware transformer architecture that predicts both bilateral grids for appearance transformation and bilateral confidence grids to make the model uncertainty-aware. This enables robust handling of varying appearance conditions across views.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[26] Simvs: Simulating world inconsistencies for robust view synthesis PDF

Alex Trevithick, Roni Paiss, Philipp Henzler, Dor Verbin, Rundi Wu, Hadi Alzayer, Ruiqi Gao, Ben Poole, Jonathan T. Barron, Aleksander Holynski, Ravi Ramamoorthi, Pratul P. Srinivasan (2025)

[33] Generative multiview relighting for 3d reconstruction under extreme illumination variation PDF

Hadi Alzayer, Philipp Henzler, Jonathan T. Barron, Jia Bin Huang, Pratul P. Srinivasan, Jia-Bin Huang, Dor Verbin (2025)

[36] Specular-to-Diffuse Translation for Multi-View Reconstruction PDF

Wu Shihao, Shihao Wu, Huang Hui, Hui Huang, Portenier, Tiziano, Tiziano Portenier, Sela, Matan, Matan Sela, Cohen-Or, Danny, Danny Cohen-Or, Kimmel, Ron, Ron Kimmel, Zwicker, Matthias, Matthias Zwicker (2022)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Feed-forward multi-view appearance harmonization via bilateral grid prediction

[61] Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting PDF

Cannot Refute

Contribution

Hybrid self-supervised rendering loss using 3D foundation models

[51] DINOv2: Learning Robust Visual Features without Supervision PDF

Cannot Refute

[52] HyperEAST: An enhanced attention-based spectral-spatial transformer with self-supervised pretraining for hyperspectral image classification PDF

Cannot Refute

[53] Self-supervised learning for fine-grained monocular 3D face reconstruction in the wild PDF

Cannot Refute

[54] Multi-Object Tracking by Self-supervised Learning Appearance Model PDF

Cannot Refute

[55] Research and development of self-supervised visual feature learning based on neural networks PDF

Cannot Refute

[56] Transferable visual words: Exploiting the semantics of anatomical patterns for self-supervised learning PDF

Cannot Refute

[57] Distillnerf: Perceiving 3d scenes from single-glance images by distilling neural fields and foundation model features PDF

Cannot Refute

Contribution

Multi-view aware transformer with bilateral confidence grids

[58] Image-adaptive 3d lookup tables for real-time image enhancement with bilateral grids PDF

Cannot Refute

[59] Slicer networks PDF

Cannot Refute

[60] Rethinking Affine Transform for Efficient Image Enhancement: A Color Space Perspective PDF

Cannot Refute

CHROMA: Consistent Harmonization of Multi-View Appearance via Bilateral Grid Prediction

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[26] Simvs: Simulating world inconsistencies for robust view synthesis PDF

[33] Generative multiview relighting for 3d reconstruction under extreme illumination variation PDF

[36] Specular-to-Diffuse Translation for Multi-View Reconstruction PDF

Contribution Analysis

Feed-forward multi-view appearance harmonization via bilateral grid prediction

[61] Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting PDF

Hybrid self-supervised rendering loss using 3D foundation models

[51] DINOv2: Learning Robust Visual Features without Supervision PDF

[52] HyperEAST: An enhanced attention-based spectral-spatial transformer with self-supervised pretraining for hyperspectral image classification PDF

[53] Self-supervised learning for fine-grained monocular 3D face reconstruction in the wild PDF

[54] Multi-Object Tracking by Self-supervised Learning Appearance Model PDF

[55] Research and development of self-supervised visual feature learning based on neural networks PDF

[56] Transferable visual words: Exploiting the semantics of anatomical patterns for self-supervised learning PDF

[57] Distillnerf: Perceiving 3d scenes from single-glance images by distilling neural fields and foundation model features PDF

Multi-view aware transformer with bilateral confidence grids

[58] Image-adaptive 3d lookup tables for real-time image enhancement with bilateral grids PDF

[59] Slicer networks PDF

[60] Rethinking Affine Transform for Efficient Image Enhancement: A Color Space Perspective PDF

Table of Contents