A practitioner shares six non-obvious architectural insights gained from implementing GPT-2 from scratch with PyTorch, covering: RsLoRA's rank-stabilized scaling fix (with statistical proof that LoRA's variance shrinks as rank grows), why RoPE outperforms sinusoidal and learned positional embeddings, when weight tying makes sense vs. when it disappears at scale, the stability tradeoff between Pre-LayerNorm and Post-LayerNorm, how KV Cache reduces attention compute from O(T²) to O(T) and its memory cost, and why LayerNorm is skipped during INT8 quantization due to its sensitivity relative to negligible parameter savings.

11m read timeFrom towardsdatascience.com
Post cover image
Table of contents
1. LoRA vs RsLoRA (Rank Stabilized):2. RoPE instead of Learned Parameters or Sinusoidal Positional Embeddings (PEs)3. Weight Tying4. Pre-LayerNorm vs Post-LayerNorm5. KV-Cache6. Quantization Tradeoff: Why LayerNorm is skipped during INT8 quantizationConclusionReferences

Sort: