A deep technical exploration of attention mechanisms in generative AI, starting from standard softmax attention and KV caching, then progressing to linear attention as a way to achieve constant memory complexity. Covers the recurrent formulation of linear attention, chunkwise parallel training for hardware efficiency, the delta update rule (framed as online SGD on a regression objective), gating mechanisms for recency bias, and advanced extensions like test-time training (TTT), Titans, DeltaNet, and Longhorn. Explains how linear attention can be viewed as training a neural memory module that compresses key-value associations into a state matrix rather than explicitly caching them.

34m watch time

Sort: