Knowledge distillation is a model compression technique where a smaller 'student' model is trained to mimic the output of a larger 'teacher' model. Using a CNN teacher trained on MNIST as an example, a simpler feed-forward student model is trained using KL divergence as the loss function to match the teacher's probability distributions. The result is a student model that is 35% faster at inference with only a 1-2% drop in accuracy. A key tradeoff is that the larger teacher model must still be trained first, which may be impractical in resource-constrained settings. DistilBERT is cited as a real-world example, being 40% smaller than BERT while retaining 97% of its capabilities.

4m read timeFrom blog.dailydoseofds.com
Post cover image
Table of contents
Automated release docs for engineering teamsImplement knowledge distillation from scratch.

Sort: