Differential Transformer V2 introduces a redesigned attention mechanism that doubles query heads while maintaining key-value heads, eliminating the need for custom kernels and achieving faster decoding speeds. The architecture removes per-head RMSNorm to improve training stability, introduces token-level and head-level lambda

9m read time From huggingface.co
Post cover image
Table of contents
AbstractCodeMotivationExperimental ObservationsDiscussions

Sort: