1S-Lab, NTU 2BAAI 3HUST 4AIR,THU 5FNii, CUHKSZ 6SJTU 7EIT (Ningbo)
Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control. Besides, our model surpasses prior video relighting methods in text- and background-conditioned settings. Ablation studies further validate the effectiveness of the disentangled formulation and degradation pipeline.
Overview of Light-X. Given an input video \( \mathbf{V}^s \), we first relight one frame with IC-Light, conditioned on a lighting text prompt, to obtain a sparse relit video \( \hat{\mathbf{V}}^s \). We then estimate depths to construct a dynamic point cloud \( \mathcal{P} \) from \( \mathbf{V}^s \) and a relit point cloud \( \hat{\mathcal{P}} \) from \( \hat{\mathbf{V}}^s \). Both point clouds are projected along a user-specified camera trajectory, producing geometry-aligned renders and masks \( (\mathbf{V}^p, \mathbf{V}^m) \) and \( (\hat{\mathbf{V}}^p, \hat{\mathbf{V}}^m) \). These six cues, together with illumination tokens extracted via a Q-Former, are fed into DiT blocks for conditional denoising. Finally, a VAE decoder reconstructs a high-fidelity video \( \mathbf{V}^t \) faithful to the target trajectory and illumination.
@article{lightx,
title = {Light-X: Generative 4D Video Rendering with Camera and Illumination Control},
author = {Liu, Tianqi and Chen, Zhaoxi and Huang, Zihao and Xu, Shaocong and Zhang, Saining and Ye, Chongjie and Li, Bohan and Cao, Zhiguo and Li, Wei and Zhao, Hao and Liu, Ziwei},
journal = {arXiv preprint arXiv:2512.05115},
year = {2025}
}