Transformer Self-Attention Visualization (DistilBERT)

Understanding Transformer Self-Attention

Rows = Query token (the token doing the 'looking').
Columns = Key token (the token being 'looked at').
Darker color = stronger attention weight.

Transformers process all tokens in parallel, allowing any token to attend to any other token in the sentence. This makes it easier for the model to capture long-distance relationships.

Attention Heatmap