A conceptual walkthrough of how AI video generation works, starting from the basics of diffusion models. Covers forward and reverse diffusion for images, then extends the concept to video by treating clips as 3D spatial-temporal patches processed by a vision transformer. Explains two key challenges: temporal consistency (solved via attention over space-time patches) and computational cost (solved via latent diffusion using an autoencoder to compress frames before diffusion). The full pipeline is summarized: encode frames into latent space, apply iterative denoising via a transformer diffusion model conditioned on text, then decode back to pixels.

11m watch time

Sort: