Large language models (LLMs) such as ChatGPT are built through a complex pre-training process involving the downloading and processing of large quantities of diverse, high-quality internet texts. Common Crawl data, along with filtering steps like URL filtering, text extraction, and language filtering, are critical components. Tokenization converts these texts into a sequence of symbols for neural networks to process. These networks are trained to model the statistical relationships between tokens to predict the next token in a sequence. Inference is generating new data from the trained model by predicting subsequent tokens based on a given input.

3h 31m watch time

Sort: