In the previous article on training large-scale models, we looked at LoRA. In this article, we will examine another strategy adopted by different large language models for efficient training —…

Towards Data Science is a community-powered publication that showcases work in data science, machine learning and artificial intelligence. Every day newcomers, seasoned researchers and industry practitioners publish tutorials, research notes and real-world case studies that help the field move forward.

Towards Data Science

This post explores the concept of Grouped Query Attention (GQA), which is a combination of Multi-Head Attention (MHA) and Multi-Query Attention (MQA). GQA offers a trade-off between quality and speed, minimizing memory bandwidth demands by grouping query heads. It has been used in recent large-scale language models like LLaMA-2 and Mistral7B for efficient pre-training.

Demystifying GQA — Grouped Query Attention for Efficient LLM Pre-training