Best of PyTorchJune 2025

  1. 1
    Article
    Avatar of sebastianraschkaSebastian Raschka·47w

    Coding LLMs from the Ground Up: A Complete Course

    Sebastian Raschka shares a comprehensive video course series on building Large Language Models from scratch using Python and PyTorch. The course covers seven key areas: environment setup, text data preprocessing and tokenization, attention mechanisms implementation, LLM architecture coding, pretraining on unlabeled data, classification fine-tuning, and instruction fine-tuning. The content serves as supplementary material to his book 'Build a Large Language Model (From Scratch)' and emphasizes hands-on learning through implementation rather than using pre-built frameworks.

  2. 2
    Video
    Avatar of youtubeYouTube·45w

    STOP Taking Random AI Courses - Read These Books Instead

    A comprehensive guide to learning AI and machine learning through structured resources rather than random courses. Covers five key areas: programming fundamentals with Python, mathematics and statistics foundations, traditional machine learning concepts, deep learning and LLMs, and AI engineering for production deployment. Emphasizes practical application over theoretical study, recommending specific books like 'Hands-On ML with Scikit-Learn and TensorFlow' and courses like Andrew Ng's specializations. Highlights the importance of understanding both foundational concepts and modern deployment practices for current AI engineering roles.

  3. 3
    Article
    Avatar of sebastianraschkaSebastian Raschka·46w

    Understanding and Coding the KV Cache in LLMs from Scratch

    KV cache is a critical optimization technique for LLM inference that stores previously computed key and value vectors to avoid redundant calculations during text generation. The technique provides significant speed improvements (up to 5x in examples) by caching intermediate attention computations and reusing them for subsequent tokens. Implementation involves modifying the attention mechanism to store and retrieve cached values, though it increases memory usage and code complexity. The article provides a complete from-scratch implementation with performance comparisons and optimization strategies for production use.

  4. 4
    Article
    Avatar of huggingfaceHugging Face·45w

    (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware

    QLoRA enables fine-tuning of FLUX.1-dev diffusion models on consumer hardware with under 10GB VRAM by combining 4-bit quantization with Low-Rank Adaptation. The approach uses bitsandbytes for quantization, 8-bit AdamW optimizer, gradient checkpointing, and cached latents to dramatically reduce memory usage from ~120GB to ~9GB. Training on RTX 4090 takes 41 minutes for 700 steps, while FP8 training with torchao on H100 reduces time to 20 minutes. The technique maintains high-quality results while making advanced model customization accessible to developers without enterprise-grade hardware.