A comprehensive hands-on guide to building a language-specific LLM from scratch, using Urdu as the target language. Covers the full pipeline: data collection and cleaning from Hugging Face's CulturaX dataset, training a BPE tokenizer with 32K vocabulary using the tokenizers library, implementing a decoder-only GPT-style transformer architecture in PyTorch with multi-head self-attention, and running pre-training on Google Colab's free T4 GPU. Includes detailed explanations of model configuration parameters, training hyperparameters, learning rate scheduling with warmup and cosine decay, and text generation strategies like top-K and nucleus sampling. The guide also covers supervised fine-tuning and deployment with Gradio.
Table of contents
Components of LLM Training1. Data Preparation2. Tokenization3. Pre-Training4. Supervised Fine-Tuning (SFT)5. DeploymentFull Pipeline SummaryResultsConclusionSort: