Transformers lack inherent positional awareness due to permutation equivariance in attention mechanisms. Sinusoidal absolute positional encoding was the original fix, but relative positional encoding via RoPE (Rotary Position Embedding) offers a more robust solution. RoPE rotates query and key vectors based on token position, so dot products depend only on relative distances between tokens. Higher-dimensional vectors are handled by partitioning into 2D groups with varying rotation frequencies. High-frequency components enable position-specific attention patterns while low-frequency components support long-range semantic attention. Models like LLaMA and Gemma use RoPE. When inference context exceeds training window, techniques like Position Interpolation and NTK-based frequency scaling help extend context length, with NTK-based methods generalizing well beyond their fine-tuning window up to 64K tokens.

13m watch time

Sort: