We’ve witnessed remarkable strides in AI image generation. But what happens when we add the dimension of time? Videos are moving images, after all. Text-to-video generation is a complex task that…

Towards Data Science is a community-powered publication that showcases work in data science, machine learning and artificial intelligence. Every day newcomers, seasoned researchers and industry practitioners publish tutorials, research notes and real-world case studies that help the field move forward.

Towards Data Science

Text-to-video generation is significantly more complex than text-to-image, demanding understanding of object movement and temporal consistency. Modern video diffusion models, like VDM, Make-A-Video by Meta AI, Imagen Video, and SORA, tackle these challenges using strategies such as combining image-text and unlabelled video data, spatial and temporal layers, and latent diffusion. Large-scale datasets and computational advancements are expected to drive future innovations in this field.

The Evolution of Text to Video Models