Exclusive Self Attention (XSA) proposes a simple yet effective fit to the standard attention mechanism. With only 2 lines of code, XSA improves the overall performance without incurring significant compute and memory costs. 

00:00 Introduction
00:18 Attention layer
01:46 Feedforward network
03:05 Why can't we just ignore the self-value vector?
04:41 Attention similarity bias
05:38 Visualization of orthogonalization
07:50 Implementation of XSA
09:09 Performance 
10:25 Robustness to hyperparameters
11:52 Summary

Reference: 
- Exclusive Self Attention https://arxiv.org/abs/2603.09078

Video made with manim: https://www.manim.community/

Jia-Bin Huang

Exclusive Self-Attention (XSA) is a two-line code modification to standard transformer attention that addresses 'attention similarity bias' — the tendency for attention outputs to align with each token's own value vector rather than gathering contextual information from other tokens. By applying orthogonal projection to remove the self-value vector component from the attention output, XSA forces attention to focus purely on context from other tokens. Experiments across 0.7B to 2.7B parameter models show consistent improvements in training/validation loss and downstream benchmarks, with gains growing at longer sequence lengths and larger model sizes. The change requires no modifications to query, key, or value matrices and introduces minimal computational overhead.

We’ve Been Doing Attention Wrong (2-Line Fix)