Multi-Query Attention is a type of attention mechanism that can accelerate the speed of generating tokens in the decoder while ensuring model performance. It is a variation of multi-head attention, where all the query heads share the same set of key and value heads. MQA achieves inference acceleration by reducing memory usage,

5m read timeFrom pub.towardsai.net
Post cover image
Table of contents
Multi-Query Attention ExplainedMulti-Head Attention(MHA)Multi-Query Attention(MQA)PerformanceWhy can MQA achieve inference acceleration?ConclusionReferences

Sort: