Sliding window attention (SWA) addresses the quadratic complexity of standard self-attention by restricting each token to attend only to a fixed local window of neighbors, reducing complexity from O(n²) to O(n·w). The article explains the core mechanism, demonstrates a ~20x compute reduction with concrete numbers, and covers

11m read timeFrom digitalocean.com
Post cover image
Table of contents
IntroductionKey TakeawaysHow Traditional Attention WorksGood to know conceptsWhat is Sliding Window Attention?Complexity ComparisonUnderstanding SWAT Attention in a Simple WayStep 1: Replacing Softmax with SigmoidStep 2: Adding Positional Bias with Balanced ALiBiStep 3: Adding RoPE for Stronger Position EncodingStep 4: Efficiency of SWATSliding Window Attention in Modern ArchitecturesWhat Improved Over Basic Sliding Window AttentionBasic LimitationsFAQ’sConclusion

Sort: