Differential Transformer V2 introduces a redesigned attention mechanism that doubles query heads while maintaining key-value heads, eliminating the need for custom kernels and achieving faster decoding speeds. The architecture removes per-head RMSNorm to improve training stability, introduces token-level and head-level lambda
•9m read time• From huggingface.co
Sort: