Step into the World from a Single Image

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

A research approach for synthesizing consistent 3D scene videos from a single input image using diffusion models. The method conditions a noise predictor network on the input image and relative camera poses via cross-view attention. To improve long-range consistency, epipolar constraints are incorporated into the attention module (epipolar attention), restricting correspondences to epipolar lines. The approach outperforms prior methods in per-frame quality and temporal consistency.

•4m watch time

Sort: