6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
A practitioner shares six non-obvious architectural insights gained from implementing GPT-2 from scratch with PyTorch, covering: RsLoRA's rank-stabilized scaling fix (with statistical proof that LoRA's variance shrinks as rank grows), why RoPE outperforms sinusoidal and learned positional embeddings, when weight tying makes sense vs. when it disappears at scale, the stability tradeoff between Pre-LayerNorm and Post-LayerNorm, how KV Cache reduces attention compute from O(T²) to O(T) and its memory cost, and why LayerNorm is skipped during INT8 quantization due to its sensitivity relative to negligible parameter savings.
Table of contents
1. LoRA vs RsLoRA (Rank Stabilized):2. RoPE instead of Learned Parameters or Sinusoidal Positional Embeddings (PEs)3. Weight Tying4. Pre-LayerNorm vs Post-LayerNorm5. KV-Cache6. Quantization Tradeoff: Why LayerNorm is skipped during INT8 quantizationConclusionReferencesSort: