A walkthrough of KL divergence from first principles, starting with self-information and entropy, building up to cross-entropy, and explaining the difference between forward and reverse KL. The post then addresses the computational challenge of estimating KL divergence efficiently, demonstrating Monte Carlo estimation, its high-variance problem, and a control variates trick that achieves both low variance and unbiasedness. Practical ML applications covered include classification loss, knowledge distillation, and RLHF alignment.
•11m watch time
Sort: