Text-to-video generation is significantly more complex than text-to-image, demanding understanding of object movement and temporal consistency. Modern video diffusion models, like VDM, Make-A-Video by Meta AI, Imagen Video, and SORA, tackle these challenges using strategies such as combining image-text and unlabelled video data, spatial and temporal layers, and latent diffusion. Large-scale datasets and computational advancements are expected to drive future innovations in this field.

9m read timeFrom towardsdatascience.com
Post cover image
Table of contents
The Evolution of Text to Video ModelsText to Image OverviewThe Temporal Dimension: A New FrontierThe Evolution of Video DiffusionVideo Diffusion Model (VDM) — 2022Make-A-Video (Meta AI) — 2022Imagen Video (Google) — 2022VideoLDM (NVIDIA) — 2023SORA (OpenAI) — 2024What’s next

Sort: