Multi-Query Attention is a type of attention mechanism that can accelerate the speed of generating tokens in the decoder while ensuring model performance. It is a variation of multi-head attention, where all the query heads share the same set of key and value heads. MQA achieves inference acceleration by reducing memory usage, lowering computational complexity, and improving computational utilization.
Table of contents
Multi-Query Attention ExplainedMulti-Head Attention(MHA)Multi-Query Attention(MQA)PerformanceWhy can MQA achieve inference acceleration?ConclusionReferencesSort: