Artificial Intelligence

Multi-Head Attention

An extension of attention where multiple attention mechanisms (heads) run in parallel, each learning to focus on different types of relationships in the data. The outputs are then combined.

Why It Matters

Multi-head attention lets the model simultaneously attend to different aspects — syntax, semantics, long-range dependencies — rather than being limited to one type of pattern.

Example

One head might learn to track subject-verb agreement, another tracks pronoun references, and a third captures topical relationships — all working simultaneously.

Think of it like...

Like having a panel of experts review a document simultaneously — one focuses on grammar, another on logic, another on factual accuracy, and their insights are combined.

Multi-Head Attention

Why It Matters

Example

Think of it like...

Related Terms

Self-Attention

Attention Mechanism

Transformer

Positional Encoding