Light-X: Generative 4D Video Rendering with Camera and Illumination Control

ICLR 2026 Conference SubmissionAnonymous Authors
Controllable Video GenerationVideo RelightingJoint Camera–Illumination Control
Abstract:

Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera–illumination control. Besides, our model surpasses prior video relighting methods in text- and background-conditioned settings. Ablation studies further validate the effectiveness of the disentangled formulation and degradation pipeline. Code, data and models will be made public.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Light-X, a framework for joint camera trajectory and illumination control in video generation, using a disentangled design that separates geometry (via dynamic point clouds) from lighting (via relit frames). Within the taxonomy, it resides in the 'Disentangled Geometry-Lighting Control' leaf under 'Unified Multi-Modal Control Frameworks'. This leaf contains only two papers, including the original work and one sibling (Vidcraft3), indicating a sparse and emerging research direction. The taxonomy shows that most prior work addresses camera or lighting control separately, making this joint disentangled approach relatively uncommon.

The taxonomy reveals that neighboring leaves include 'Implicit Joint Control via Unified Conditioning' (two papers) and broader branches like 'Camera Trajectory Control' (nine papers) and 'Illumination and Relighting Control' (seven papers). The scope notes clarify that methods controlling only camera (e.g., CamCo) or only lighting (e.g., Relightable Portrait Animation) belong to specialized single-modality branches. Light-X diverges from these by explicitly separating geometric and photometric signals, whereas implicit joint control methods condition on multiple signals without architectural disentanglement. This positioning suggests the work bridges previously isolated research directions.

Among 22 candidates examined, the disentangled conditioning scheme (Contribution 2) shows overlap with prior work: 10 candidates were reviewed, and 2 appear to provide refutable evidence. The Light-X framework itself (Contribution 1) and the Light-Syn data pipeline (Contribution 3) examined 10 and 2 candidates respectively, with no clear refutations found. The limited search scope means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage. The disentangled conditioning appears less novel given the identified overlaps, while the overall framework and data synthesis strategy show fewer direct precedents within the examined set.

Based on the limited literature search (22 candidates), the work appears to occupy a sparsely populated research direction, with only one sibling paper in its taxonomy leaf. The disentangled conditioning scheme has some precedent among the examined candidates, but the integrated framework and data synthesis approach show fewer overlaps. The analysis covers top-K semantic matches and does not claim exhaustive field coverage, so additional related work may exist beyond the examined scope.

Taxonomy

Core-task Taxonomy Papers
35
3
Claimed Contributions
22
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: joint control of camera trajectory and illumination in video generation. The field has evolved into several distinct branches that reflect different emphases in controllable video synthesis. Unified Multi-Modal Control Frameworks aim to integrate multiple control signals—such as camera motion, lighting conditions, and scene geometry—into a single generation pipeline, often disentangling these factors to enable independent manipulation. Camera Trajectory Control focuses specifically on specifying and executing camera paths, whether through explicit pose sequences or higher-level cinematic directives. Illumination and Relighting Control addresses the challenge of manipulating lighting after capture or during synthesis, drawing on neural rendering and light transport modeling. Motion Control and Trajectory Specification explores how to represent and guide object or scene motion, while 3D Scene Representation and Rendering leverages volumetric or neural scene models to support view synthesis. Cinematic and Optical Control targets film-like aesthetics, and Benchmarks, Taxonomies, and Evaluation Frameworks provide the infrastructure for systematic comparison. Representative works such as CamCo[3] and Direct-a-Video[5] illustrate camera-centric approaches, while Relightable Portrait Animation[7] and Neural Light Transport[8] exemplify lighting-focused methods. Recent efforts have increasingly sought to unify these control dimensions rather than treating them in isolation. A handful of works, including Uni3C[2] and Gen3c[4], demonstrate multi-modal frameworks that coordinate camera, lighting, and motion signals within a single diffusion or neural rendering backbone. Light-X[0] sits within this emerging cluster of disentangled geometry-lighting control methods, emphasizing the separation of camera trajectory from illumination changes so that each can be adjusted independently during generation. This contrasts with earlier camera-only approaches like CamCo[3], which prioritize pose control but do not explicitly model lighting variation, and with relighting-focused methods such as Relightable Portrait Animation[7], which handle illumination but typically assume fixed viewpoints. By disentangling these two axes, Light-X[0] and its close neighbor Vidcraft3[1] address a key challenge in achieving flexible, cinematic video synthesis where both viewpoint and lighting can be freely manipulated.

Claimed Contributions

Light-X framework for joint camera and illumination control

The authors introduce Light-X, the first framework that jointly controls camera trajectory and illumination for video generation from monocular input videos. This enables novel-view synthesis with simultaneous lighting manipulation, addressing a gap left by prior methods that handle only one control dimension.

10 retrieved papers
Disentangled conditioning scheme separating geometry and lighting

The authors develop a conditioning formulation that explicitly separates geometric and motion information (via dynamic point clouds) from illumination cues (via projected relit frames). This disentanglement enables fine-grained control and effective learning of both camera trajectory and lighting.

10 retrieved papers
Can Refute
Light-Syn degradation-based data synthesis pipeline

The authors propose Light-Syn, a data curation pipeline that addresses the scarcity of paired multi-view and multi-illumination videos by synthesizing training pairs from monocular footage through degradation and inverse geometric mapping. This enables robust training across static, dynamic, and AI-generated scenes.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Light-X framework for joint camera and illumination control

The authors introduce Light-X, the first framework that jointly controls camera trajectory and illumination for video generation from monocular input videos. This enables novel-view synthesis with simultaneous lighting manipulation, addressing a gap left by prior methods that handle only one control dimension.

Contribution

Disentangled conditioning scheme separating geometry and lighting

The authors develop a conditioning formulation that explicitly separates geometric and motion information (via dynamic point clouds) from illumination cues (via projected relit frames). This disentanglement enables fine-grained control and effective learning of both camera trajectory and lighting.

Contribution

Light-Syn degradation-based data synthesis pipeline

The authors propose Light-Syn, a data curation pipeline that addresses the scarcity of paired multi-view and multi-illumination videos by synthesizing training pairs from monocular footage through degradation and inverse geometric mapping. This enables robust training across static, dynamic, and AI-generated scenes.