Artificial Intelligence

Attention Head

A single attention computation within multi-head attention. Each head independently computes attention scores, allowing different heads to specialize in different types of relationships.

Why It Matters

Understanding attention heads helps interpret how transformers process information and enables techniques like head pruning for model optimization.

Example

In a 12-head attention layer, one head might specialize in tracking subject-verb relationships, another in coreference resolution, and another in positional patterns.

Think of it like...

Like members of a jury each paying attention to different aspects of a case — one focuses on evidence, another on testimony, another on motive — then they combine insights.

Related Terms