Era 13 / 15 · The Transformer 2017

The Transformer

Attention let every word look at every other — in parallel.

Beat 1 · Concrete

Which noun is “it”?

To resolve the pronoun, “it” weighs both candidates — the stronger link wins.

The trophy didn’t fit in the suitcase because it was too big. — “it” attends most to trophy.

Resolving a pronoun by attention The word “it” sends links to the two candidate nouns; the link to “trophy” grows strongest, the link to “suitcase” stays weak.

winning link competing link the sentence

Beat 2 · Abstract

Every token attends to every token

Self-attention: each token links to all the others, weighted — brighter means heavier.

Each word sends a weighted link to every other word at once. trophybig is the heaviest pair.

Self-attention weights as links Five tokens are joined by a faint all-to-all mesh; in turn each token becomes the query and its outgoing links brighten in proportion to their attention weight.

all pairs (faint) heaviest link

Beat 3 · Interactive

Pick a word, watch it look

Hover, tap, or focus any token — its attention lights up to the words it depends on.

The trophy didn’t fit in the suitcase because it was too big. With scripting on, choose any word to see its links.

strong dependency competing link click / focus a token

The deeper cut
2017
Attention Is All You Need
Vaswani et al. threw out recurrence entirely — the model is just stacked self-attention.
Q · K · V
Queries, keys, values
Each token emits a query, matches every key, and pulls back a weighted blend of values.
parallel
No left-to-right wait
All tokens compute at once — parallelism replaces recurrence; BERT & GPT follow.