A deep technical analysis of recent architectural innovations in open-weight LLMs focused on long-context efficiency. Covers four key developments: (1) Gemma 4's cross-layer KV sharing and per-layer embeddings (PLE) that reduce KV cache size by ~50% at 128K context; (2) Laguna XS.2's per-layer query-head budgeting that allocates more heads to sliding-window layers and fewer to global attention; (3) ZAYA1-8B's Compressed Convolutional Attention (CCA) that performs attention directly in a compressed latent space with convolutional mixing on Q/K; and (4) DeepSeek V4's manifold-constrained hyper-connections (mHC) for wider residual streams plus CSA/HCA compressed attention that achieves 27% of FLOPs and 10% of KV cache size vs DeepSeek V3.2 at 1M-token context. The overarching trend is that transformer architectures are increasingly specialized for long-context inference efficiency rather than wholesale replacement.

25m read timeFrom magazine.sebastianraschka.com
Post cover image
Table of contents
Previous Topics1. Reusing KV Tensors Across Layers to Shrink the Cache (Gemma 4)2. Per-Layer Embeddings and “Effective” Size (Gemma 4 E2B/E4B)3. Layer-Wise Attention Budgeting (Laguna XS.2)4. Compressed Convolutional Attention (ZAYA1-8B)5. CSA/HCA, mHC, and Compressed Attention Caches (DeepSeek V4)6. Conclusion
1 Comment

Sort: