https://sander.ai/2025/04/15/latents.html

Speaker info:
- https://sander.ai/
- https://github.com/benanne
- https://www.linkedin.com/in/sanderdieleman
- https://x.com/sanderdieleman

Timestamps
0:00 Introduction
2:55 Data Curation
4:02 Representation
9:39 Modeling: Diffusion Mechanism
20:01 Network Architecture
22:25 Training at Scale
23:33 Sampling & Guidance
28:03 Distillation
30:03 Control Signals

AI Engineer

Sander Dieleman, research scientist at Google DeepMind, gives a behind-the-scenes overview of training large-scale generative image and video models. The talk covers eight key areas: data curation (often underrated but critical), latent representations via autoencoders to compress pixel data, the mechanics of diffusion models as iterative denoising, frequency-domain analysis showing diffusion acts as spectral autoregression, transformer-based architectures replacing U-Nets, training at scale with JAX and model parallelism, classifier-free guidance for trading diversity for quality, distillation techniques like consistency models to reduce sampling steps, and conditioning signals beyond text prompts for camera control and reference-based generation.

Building Generative Image & Video models at Scale - Sander Dieleman, Google DeepMind