Sliding window attention (SWA) addresses the quadratic complexity of standard self-attention by restricting each token to attend only to a fixed local window of neighbors, reducing complexity from O(n²) to O(n·w). The article explains the core mechanism, demonstrates a ~20x compute reduction with concrete numbers, and covers
Table of contents
IntroductionKey TakeawaysHow Traditional Attention WorksGood to know conceptsWhat is Sliding Window Attention?Complexity ComparisonUnderstanding SWAT Attention in a Simple WayStep 1: Replacing Softmax with SigmoidStep 2: Adding Positional Bias with Balanced ALiBiStep 3: Adding RoPE for Stronger Position EncodingStep 4: Efficiency of SWATSliding Window Attention in Modern ArchitecturesWhat Improved Over Basic Sliding Window AttentionBasic LimitationsFAQ’sConclusionSort: