Kullback–Leibler (KL) divergence measures the difference between two probability distributions. But where does that come from?

In this video, we provide an overview of KL divergence and discuss how to develop a practical method for estimating it. 

00:00 Introduction
00:52 Surprise (Self-information)
01:55 Entropy
03:24 Cross-entropy
03:42 KL divergence
04:33 Asymmetry in KL divergence
06:34 Computation challenge of KL divergence
07:13 Monte Earlo estimation
09:11 Biased estimator
10:23 Unbiased and low-variance estimator

Reference:
- The low-variance Monte-Carlo estimator discussed in the second half of the video is from John Schulman's blog post. If you want to learn more, definitely check it out for more details!
http://joschu.net/blog/kl-approx.html

Video made with Manim: https://www.manim.community/

Jia-Bin Huang

A walkthrough of KL divergence from first principles, starting with self-information and entropy, building up to cross-entropy, and explaining the difference between forward and reverse KL. The post then addresses the computational challenge of estimating KL divergence efficiently, demonstrating Monte Carlo estimation, its high-variance problem, and a control variates trick that achieves both low variance and unbiasedness. Practical ML applications covered include classification loss, knowledge distillation, and RLHF alignment.

Fantastic KL Divergence and How to (Actually) Compute It