This post explores the concept of Grouped Query Attention (GQA), which is a combination of Multi-Head Attention (MHA) and Multi-Query Attention (MQA). GQA offers a trade-off between quality and speed, minimizing memory bandwidth demands by grouping query heads. It has been used in recent large-scale language models like LLaMA-2 and Mistral7B for efficient pre-training.

3m read timeFrom towardsdatascience.com
Post cover image
Table of contents
Demystifying GQA — Grouped Query Attention for Efficient LLM Pre-trainingEmergence of Multi-Query Attention (MQA)

Sort: