Artificial Intelligence

Attention Sink

A phenomenon in transformers where the first few tokens in a sequence receive disproportionately high attention scores regardless of their content, acting as 'sinks' for excess attention.

Why It Matters

Understanding attention sinks helps improve model efficiency and enables techniques like StreamingLLM that maintain performance over very long sequences.

Example

The beginning-of-sequence token receiving high attention scores in every layer even though it carries no semantic information — it acts as a default attention target.

Think of it like...

Like a default option in a survey that people select when they are not sure — it absorbs attention that does not have a better target.

Related Terms