Linear attention and its variants have emerged as promising techniques for sequential modeling. Compared to standard softmax attention in Transformers, these models achieve faster decoding and a constant memory requirement regardless of the sequence length. Such methods may hold the key to unlocking long-context processing capability.

In this video, let's explore what comes after softmax attention.

00:00 Introduction
00:13 Softmax attention - Review
02:23 Softmax attention - Matrix form
03:29 KV caching
05:29 Linear attention
10:15 Chunkwise parallel training
14:41 Gating in linear attention
17:02 Test-time regression perspective
21:29 Delta update rule
23:51 Efficient training of DeltaNet
29:12 Better optimization for test-time regression
31:13 More expressive regressors

References:

[Linear Attention and Beyond] https://www.youtube.com/watch?v=d0HJvGSWw8A
(by Songlin Yang)
[Test-time Regression] https://www.youtube.com/watch?v=C7KnW8VFp4U
(by Alex Wang)
[Beyond Standard LLMs] https://magazine.sebastianraschka.com/p/beyond-standard-llms
(by Sebastian Raschka)

[Linear Attention] https://arxiv.org/abs/2006.16236
[Chunkwise parallel training] https://arxiv.org/abs/2202.10447

[Gated Linear Attention] https://arxiv.org/abs/2312.06635
[Lightning Attention] https://arxiv.org/abs/2405.17381
[Mamba 2] https://arxiv.org/abs/2405.21060

[Test-time Regression] https://arxiv.org/abs/2501.12352
[Fast Weight Programmer] https://proceedings.mlr.press/v139/schlag21a

[Gated Delta Networks] https://arxiv.org/abs/2412.06464
[Qwen-Next] https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd
[RWKV-6] https://arxiv.org/abs/2404.05892
[RWKV-7] https://arxiv.org/abs/2503.14456
[Kimi-Linear] https://arxiv.org/abs/2510.26692

[DeltaProduct] https://arxiv.org/abs/2502.10297
[LongHorn] https://arxiv.org/abs/2407.14207
[Mesa layer] https://arxiv.org/abs/2309.05858
[MesaNet] https://arxiv.org/abs/2506.05233

[Test Time Training]
[Titans] https://arxiv.org/abs/2501.00663
[Test Time Training Done Right] https://arxiv.org/abs/2505.23884
[TTT-E2E] https://arxiv.org/abs/2512.23675

Video made with manim: https://www.manim.community/

Jia-Bin Huang

A deep technical exploration of attention mechanisms in generative AI, starting from standard softmax attention and KV caching, then progressing to linear attention as a way to achieve constant memory complexity. Covers the recurrent formulation of linear attention, chunkwise parallel training for hardware efficiency, the delta update rule (framed as online SGD on a regression objective), gating mechanisms for recency bias, and advanced extensions like test-time training (TTT), Titans, DeltaNet, and Longhorn. Explains how linear attention can be viewed as training a neural memory module that compresses key-value associations into a state matrix rather than explicitly caching them.

Beyond Softmax: The Future of Attention Mechanisms