Consistent View Synthesis with Pose-Guided Diffusion Models

Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia-Bin Huang, and Johannes Kopf
IEEE/CFV Conference on Computer Vision and Pattern Recognition (CVPR), 2023

📝 Paper: https://arxiv.org/abs/2303.17598
🌐 Website: https://poseguided-diffusion.github.io/

📄 Abstract: Novel view synthesis from a single image has been a cornerstone problem for many Virtual Reality applications that provide immersive experiences. However, most existing techniques can only synthesize novel views within a limited range of camera motion or fail to generate consistent and high-quality novel views under significant camera movement. In this work, we propose a pose-guided diffusion model to generate a consistent long-term video of novel views from a single image. We design an attention layer that uses epipolar lines as constraints to facilitate the association between different viewpoints. Experimental results on synthetic and real-world datasets demonstrate the effectiveness of the proposed diffusion model against state-of-the-art transformer-based and GAN-based approaches.


🎵 Music: Dreams by Benjamin Tissot
Bensound.com/royalty-free-music
License code: DXH89GRKD9BRVF2Z

Jia-Bin Huang

A research approach for synthesizing consistent 3D scene videos from a single input image using diffusion models. The method conditions a noise predictor network on the input image and relative camera poses via cross-view attention. To improve long-range consistency, epipolar constraints are incorporated into the attention module (epipolar attention), restricting correspondences to epipolar lines. The approach outperforms prior methods in per-frame quality and temporal consistency.

Step into the World from a Single Image