Ever wondered what actually happens between typing a prompt and getting a video? In this episode of Release Notes Explained, we break down the complex architecture of state-of-the-art AI video models.

Subscribe to Google for Developers → https://goo.gle/developers  

Speaker: Nikita Namjoshi 
Products Mentioned: Google AI

Google for Developers

A conceptual walkthrough of how AI video generation works, starting from the basics of diffusion models. Covers forward and reverse diffusion for images, then extends the concept to video by treating clips as 3D spatial-temporal patches processed by a vision transformer. Explains two key challenges: temporal consistency (solved via attention over space-time patches) and computational cost (solved via latent diffusion using an autoencoder to compress frames before diffusion). The full pipeline is summarized: encode frames into latent space, apply iterative denoising via a transformer diffusion model conditioned on text, then decode back to pixels.

How Does Video Generation Work?