A text and code version of Karpathy’s famous tokenizer video.

fast.ai provides educational resources and tutorials for learning about deep learning and machine learning. Readers can access courses, code examples, and practical projects to gain hands-on experience with  AI technologies. Additionally, they can learn about state-of-the-art deep learning techniques, model architectures, and real-world applications of machine learning.

fast.ai

A comprehensive technical walkthrough of tokenization in large language models, covering the byte-pair encoding (BPE) algorithm from first principles. Explains why tokenization causes common LLM issues like poor spelling and arithmetic, demonstrates building a GPT-style tokenizer from scratch with complete Python implementations, and compares different approaches including GPT-2, GPT-4, and SentencePiece. Includes hands-on exercises for implementing your own tokenizer and understanding the trade-offs between byte-level and code-point-level tokenization strategies.

Let’s Build the GPT Tokenizer: A Complete Guide to Tokenization in LLMs – fast.ai