We’ve Been Doing Attention Wrong (2-Line Fix)
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
Exclusive Self-Attention (XSA) is a two-line code modification to standard transformer attention that addresses 'attention similarity bias' — the tendency for attention outputs to align with each token's own value vector rather than gathering contextual information from other tokens. By applying orthogonal projection to remove the self-value vector component from the attention output, XSA forces attention to focus purely on context from other tokens. Experiments across 0.7B to 2.7B parameter models show consistent improvements in training/validation loss and downstream benchmarks, with gains growing at longer sequence lengths and larger model sizes. The change requires no modifications to query, key, or value matrices and introduces minimal computational overhead.
Sort: