Member-only story

Navigating the Attention Landscape: MHA, MQA, and GQA Decoded

5 min readJan 4, 2024

Attention mechanisms are the driving force behind many of today’s cutting-edge Large Language Models. They allow these models to focus on relevant parts of an input sequence, like a sentence or document, and extract meaning with astonishing accuracy. But with different flavours of attention popping up, things can get confusing. Today, we’ll explore three key players:

Multi-Head Attention (MHA)
Multi-Query Attention (MQA)
Grouped-Query Attention (GQA)

Whether you’re a complete beginner or someone with a basic understanding of NLP, I tried to make this article designed in a way to provide clear explanations related to MHA, MQA and GQA.

The Attention Spotlight:

Attention mechanisms focus on relevant parts of an input sequence like sentences or documents.
Imagine a magician pulling a rabbit out of a hat — attention mechanisms selectively “reveal” important information.
Attention plays a crucial role in tasks like understanding sentiment, translating languages, and answering questions.

Multi-Head Attention: The Star Player

Point 1: MHA uses multiple “heads” to attend to different aspects of the input simultaneously, like a multi-tasking detective.
Point 2: Think of it as reading a newspaper with multiple headlines — MHA captures various nuances of the text.
Point 3: MHA provides high-quality results but can be computationally expensive due to parallel processing.

2. Multi-Query Attention: The Speedster

Point 1: MQA uses only one head for faster processing, like a sprinter racing through the input.
Point 2: Imagine skimming a document for keywords — MQA quickly grasps the main gist of the content.
Point 3: While efficient, MQA might miss important details compared to MHA’s in-depth analysis.

3. Grouped-Query Attention: The Rising Star

Point 1: GQA balances speed and detail by grouping queries and focusing attention on those groups.
Point 2: Think of it as dividing a team into smaller units to tackle specific tasks, yet collaborating overall.
Point 3: GQA offers faster processing than MHA while…

Navigating the Attention Landscape: MHA, MQA, and GQA Decoded

The Attention Spotlight:

Written by Shobhit Agarwal

Responses (2)