Light-X: Generative 4D Video Rendering with Camera and Illumination Control

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Controllable Video GenerationVideo RelightingJoint Camera–Illumination Control

Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera–illumination control. Besides, our model surpasses prior video relighting methods in text- and background-conditioned settings. Ablation studies further validate the effectiveness of the disentangled formulation and degradation pipeline. Code, data and models will be made public.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Light-X, a framework for joint camera trajectory and illumination control in video generation, using a disentangled design that separates geometry (via dynamic point clouds) from lighting (via relit frames). Within the taxonomy, it resides in the 'Disentangled Geometry-Lighting Control' leaf under 'Unified Multi-Modal Control Frameworks'. This leaf contains only two papers, including the original work and one sibling (Vidcraft3), indicating a sparse and emerging research direction. The taxonomy shows that most prior work addresses camera or lighting control separately, making this joint disentangled approach relatively uncommon.

The taxonomy reveals that neighboring leaves include 'Implicit Joint Control via Unified Conditioning' (two papers) and broader branches like 'Camera Trajectory Control' (nine papers) and 'Illumination and Relighting Control' (seven papers). The scope notes clarify that methods controlling only camera (e.g., CamCo) or only lighting (e.g., Relightable Portrait Animation) belong to specialized single-modality branches. Light-X diverges from these by explicitly separating geometric and photometric signals, whereas implicit joint control methods condition on multiple signals without architectural disentanglement. This positioning suggests the work bridges previously isolated research directions.

Among 22 candidates examined, the disentangled conditioning scheme (Contribution 2) shows overlap with prior work: 10 candidates were reviewed, and 2 appear to provide refutable evidence. The Light-X framework itself (Contribution 1) and the Light-Syn data pipeline (Contribution 3) examined 10 and 2 candidates respectively, with no clear refutations found. The limited search scope means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage. The disentangled conditioning appears less novel given the identified overlaps, while the overall framework and data synthesis strategy show fewer direct precedents within the examined set.

Based on the limited literature search (22 candidates), the work appears to occupy a sparsely populated research direction, with only one sibling paper in its taxonomy leaf. The disentangled conditioning scheme has some precedent among the examined candidates, but the integrated framework and data synthesis approach show fewer overlaps. The analysis covers top-K semantic matches and does not claim exhaustive field coverage, so additional related work may exist beyond the examined scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: joint control of camera trajectory and illumination in video generation. The field has evolved into several distinct branches that reflect different emphases in controllable video synthesis. Unified Multi-Modal Control Frameworks aim to integrate multiple control signals—such as camera motion, lighting conditions, and scene geometry—into a single generation pipeline, often disentangling these factors to enable independent manipulation. Camera Trajectory Control focuses specifically on specifying and executing camera paths, whether through explicit pose sequences or higher-level cinematic directives. Illumination and Relighting Control addresses the challenge of manipulating lighting after capture or during synthesis, drawing on neural rendering and light transport modeling. Motion Control and Trajectory Specification explores how to represent and guide object or scene motion, while 3D Scene Representation and Rendering leverages volumetric or neural scene models to support view synthesis. Cinematic and Optical Control targets film-like aesthetics, and Benchmarks, Taxonomies, and Evaluation Frameworks provide the infrastructure for systematic comparison. Representative works such as CamCo[3] and Direct-a-Video[5] illustrate camera-centric approaches, while Relightable Portrait Animation[7] and Neural Light Transport[8] exemplify lighting-focused methods. Recent efforts have increasingly sought to unify these control dimensions rather than treating them in isolation. A handful of works, including Uni3C[2] and Gen3c[4], demonstrate multi-modal frameworks that coordinate camera, lighting, and motion signals within a single diffusion or neural rendering backbone. Light-X[0] sits within this emerging cluster of disentangled geometry-lighting control methods, emphasizing the separation of camera trajectory from illumination changes so that each can be adjusted independently during generation. This contrasts with earlier camera-only approaches like CamCo[3], which prioritize pose control but do not explicitly model lighting variation, and with relighting-focused methods such as Relightable Portrait Animation[7], which handle illumination but typically assume fixed viewpoints. By disentangling these two axes, Light-X[0] and its close neighbor Vidcraft3[1] address a key challenge in achieving flexible, cinematic video synthesis where both viewpoint and lighting can be freely manipulated.

Claimed Contributions

Light-X framework for joint camera and illumination control

10 retrieved papers

The authors introduce Light-X, the first framework that jointly controls camera trajectory and illumination for video generation from monocular input videos. This enables novel-view synthesis with simultaneous lighting manipulation, addressing a gap left by prior methods that handle only one control dimension.

10 retrieved papers

Disentangled conditioning scheme separating geometry and lighting

Can Refute

10 retrieved papers

The authors develop a conditioning formulation that explicitly separates geometric and motion information (via dynamic point clouds) from illumination cues (via projected relit frames). This disentanglement enables fine-grained control and effective learning of both camera trajectory and lighting.

10 retrieved papers

Can Refute

Light-Syn degradation-based data synthesis pipeline

2 retrieved papers

The authors propose Light-Syn, a data curation pipeline that addresses the scarcity of paired multi-view and multi-illumination videos by synthesizing training pairs from monocular footage through degradation and inverse geometric mapping. This enables robust training across static, dynamic, and AI-generated scenes.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Vidcraft3: Camera, object, and lighting control for image-to-video generation PDF

Zheng Sixiao, Sixiao Zheng, Z. Y. Peng, Zhou Yanpeng, Yanpeng Zhou, Zimian Peng, Zhu Yi, Yi Zhu, Xu Hang, Hang Xu, Huang, Xiangru, Xiangru Huang, Fu, Yanwei, Yanwei Fu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Light-X framework for joint camera and illumination control

[1] Vidcraft3: Camera, object, and lighting control for image-to-video generation PDF

Cannot Refute

[7] High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model PDF

Cannot Refute

[13] Real-time 3d-aware portrait video relighting PDF

Cannot Refute

[36] IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation PDF

Cannot Refute

[47] Surfel-based Gaussian inverse rendering for fast and relightable dynamic human reconstruction from monocular videos PDF

Cannot Refute

[48] Neural reconstruction of relightable human model from monocular video PDF

Cannot Refute

[49] Neural video portrait relighting in real-time via consistency modeling PDF

Cannot Refute

[50] Monocular reconstruction of neural face reflectance fields PDF

Cannot Refute

[51] Joint self-supervised learning of interest point, descriptor, depth, and ego-motion from monocular video PDF

Cannot Refute

[52] Reality's Canvas, Language's Brush: Crafting 3D Avatars from Monocular Video PDF

Cannot Refute

Contribution

Disentangled conditioning scheme separating geometry and lighting

[7] High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model PDF

Can Refute

[36] IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation PDF

Can Refute

[37] UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting PDF

Cannot Refute

[38] Idol: Instant photorealistic 3d human creation from a single image PDF

Cannot Refute

[39] Controllable generation with disentangled representative learning of multiple perspectives in autonomous driving PDF

Cannot Refute

[40] Lumen: Consistent video relighting and harmonious background replacement with video generative models PDF

Cannot Refute

[41] X2Video: Adapting Diffusion Models for Multimodal Controllable Neural Video Rendering PDF

Cannot Refute

[42] BEAM: Bridging Physically-based Rendering and Gaussian Modeling for Relightable Volumetric Video PDF

Cannot Refute

[43] Generative multiview relighting for 3d reconstruction under extreme illumination variation PDF

Cannot Refute

[44] Shape-for-motion: Precise and consistent video editing with 3d proxy PDF

Cannot Refute

Contribution

Light-Syn degradation-based data synthesis pipeline

[45] Structure-centric robust monocular depth estimation via knowledge distillation PDF

Cannot Refute

[46] DBMovi-GS: Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting PDF

Cannot Refute

Light-X: Generative 4D Video Rendering with Camera and Illumination Control

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Vidcraft3: Camera, object, and lighting control for image-to-video generation PDF

Contribution Analysis

Light-X framework for joint camera and illumination control

[1] Vidcraft3: Camera, object, and lighting control for image-to-video generation PDF

[7] High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model PDF

[13] Real-time 3d-aware portrait video relighting PDF

[36] IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation PDF

[47] Surfel-based Gaussian inverse rendering for fast and relightable dynamic human reconstruction from monocular videos PDF

[48] Neural reconstruction of relightable human model from monocular video PDF

[49] Neural video portrait relighting in real-time via consistency modeling PDF

[50] Monocular reconstruction of neural face reflectance fields PDF

[51] Joint self-supervised learning of interest point, descriptor, depth, and ego-motion from monocular video PDF

[52] Reality's Canvas, Language's Brush: Crafting 3D Avatars from Monocular Video PDF

Disentangled conditioning scheme separating geometry and lighting

[7] High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model PDF

[36] IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation PDF

[37] UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting PDF

[38] Idol: Instant photorealistic 3d human creation from a single image PDF

[39] Controllable generation with disentangled representative learning of multiple perspectives in autonomous driving PDF

[40] Lumen: Consistent video relighting and harmonious background replacement with video generative models PDF

[41] X2Video: Adapting Diffusion Models for Multimodal Controllable Neural Video Rendering PDF

[42] BEAM: Bridging Physically-based Rendering and Gaussian Modeling for Relightable Volumetric Video PDF

[43] Generative multiview relighting for 3d reconstruction under extreme illumination variation PDF

[44] Shape-for-motion: Precise and consistent video editing with 3d proxy PDF

Light-Syn degradation-based data synthesis pipeline

[45] Structure-centric robust monocular depth estimation via knowledge distillation PDF

[46] DBMovi-GS: Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting PDF

Table of Contents