Text-to-video generation is significantly more complex than text-to-image, demanding understanding of object movement and temporal consistency. Modern video diffusion models, like VDM, Make-A-Video by Meta AI, Imagen Video, and SORA, tackle these challenges using strategies such as combining image-text and unlabelled video data, spatial and temporal layers, and latent diffusion. Large-scale datasets and computational advancements are expected to drive future innovations in this field.
Table of contents
The Evolution of Text to Video ModelsText to Image OverviewThe Temporal Dimension: A New FrontierThe Evolution of Video DiffusionVideo Diffusion Model (VDM) — 2022Make-A-Video (Meta AI) — 2022Imagen Video (Google) — 2022VideoLDM (NVIDIA) — 2023SORA (OpenAI) — 2024What’s nextSort: