A deep technical analysis of recent architectural innovations in open-weight LLMs focused on long-context efficiency. Covers four key developments: (1) Gemma 4's cross-layer KV sharing and per-layer embeddings (PLE) that reduce KV cache size by ~50% at 128K context; (2) Laguna XS.2's per-layer query-head budgeting that allocates more heads to sliding-window layers and fewer to global attention; (3) ZAYA1-8B's Compressed Convolutional Attention (CCA) that performs attention directly in a compressed latent space with convolutional mixing on Q/K; and (4) DeepSeek V4's manifold-constrained hyper-connections (mHC) for wider residual streams plus CSA/HCA compressed attention that achieves 27% of FLOPs and 10% of KV cache size vs DeepSeek V3.2 at 1M-token context. The overarching trend is that transformer architectures are increasingly specialized for long-context inference efficiency rather than wholesale replacement.
Table of contents
Previous Topics1. Reusing KV Tensors Across Layers to Shrink the Cache (Gemma 4)2. Per-Layer Embeddings and “Effective” Size (Gemma 4 E2B/E4B)3. Layer-Wise Attention Budgeting (Laguna XS.2)4. Compressed Convolutional Attention (ZAYA1-8B)5. CSA/HCA, mHC, and Compressed Attention Caches (DeepSeek V4)6. Conclusion1 Comment
Sort: