Multi-Query Attention is a type of attention mechanism that can accelerate the speed of generating tokens in the decoder while ensuring model performance. It is a variation of multi-head attention, where all the query heads share the same set of key and value heads. MQA achieves inference acceleration by reducing memory usage,
Table of contents
Multi-Query Attention ExplainedMulti-Head Attention(MHA)Multi-Query Attention(MQA)PerformanceWhy can MQA achieve inference acceleration?ConclusionReferencesSort: