Step into the World from a Single Image
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
A research approach for synthesizing consistent 3D scene videos from a single input image using diffusion models. The method conditions a noise predictor network on the input image and relative camera poses via cross-view attention. To improve long-range consistency, epipolar constraints are incorporated into the attention module (epipolar attention), restricting correspondences to epipolar lines. The approach outperforms prior methods in per-frame quality and temporal consistency.
ā¢4m watch time
Sort: