Check out Inngest and let your AI agents wear a harness now https://innge.st/yt-bycl-1

This video originally was going to be about the Linear Attention Saga that happened between June and November last year, but it turned out I needed quite some build up to explain what the significance of Linear Attention, 1 million context window, and compute scaling are. So it accidentally became a 17 mins video...


my latest project: Intuitive AI Academy
https://intuitiveai.academy/
limited time code "NYNM" for 50% off forever (only a 25 spots left!)

My Newsletter
https://mail.bycloud.ai/

my project: find, discover & explain AI research semantically
https://findmypapers.ai/

My Patreon
https://www.patreon.com/c/bycloud

Sauces (will try to be in order of apperance)

GPT-OSS
[Code] https://github.com/openai/gpt-oss
[Paper] https://arxiv.org/abs/2508.10925
(they did not mention it is sliding window attention explicitly in its paper)

DeepSeek V3.2
[Paper] https://arxiv.org/pdf/2512.02556

DeepSeek V2 (Multi-head Latent Attention)
[Paper] https://arxiv.org/abs/2405.04434

Kimi K2.5
[Paper] https://arxiv.org/abs/2602.02276

MiniMax
[Text-01] https://arxiv.org/abs/2501.08313
[M1] https://arxiv.org/abs/2506.13585
[M2] https://huggingface.co/MiniMaxAI/MiniMax-M2
[Zhihu Blog] https://www.zhihu.com/question/1965302088260104295/answer/1966810157473335067

Qwen-3 Next
[Project Page] https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd

Hunyuan T1
[Project Page] https://tencent.github.io/llm.hunyuan.T1/README_EN.html

Kimi Linear
[Paper] https://arxiv.org/abs/2510.26692

Context Arena (used for comparing long context performance)
[Project Page] https://contextarena.ai/ 

Gemini 3 Flash
[Blog] https://blog.google/products-and-platforms/products/gemini/gemini-3-flash/

Claude 4.6
[Blog] https://www.anthropic.com/news/claude-opus-4-6


Try out my new fav place to learn how to code https://scrimba.com/?via=bycloudAI

This video is supported by the kind Patrons & YouTube Members: 
🙏Spam Maj, Alex, Chris LeDoux, DX Research Group, Poof N' Inu, Deagan, Robert Zawiasa, Ryszard Warzocha, Midwstmakr, Tobe2d, Louis Muk, Akkusativ, Kevin Tai, Mark Buckler, NO U, Tony Jimenez, Ângelo Fonseca, jiye, Anushka, Asad Dhamani, Binnie Yiu, Calvin Yan, Clayton Ford, Diego Silva, Etrotta, Gonzalo Fidalgo, Handenon, Hector, Jake Disco very, Michael Brenner, Nilly K, OlegWock, Daddy Wen, Shuhong Chen, Sid_Cipher, Stefan Lorenz, Sup, tantan assawade, Thipok Tham, Thomas Di Martino, Thomas Lin, Richárd Nagyfi, Paperboy, mika, Leo, Berhane-Meskel, Kadhai Pesalam, mayssam, Bill Mangrum, nyaa, Toru Mon, Lame Plane, Matej Macak, thechoephix


[Discord] https://discord.gg/NhJZGtH
[Twitter] https://twitter.com/bycloudai
[Patreon] https://www.patreon.com/bycloud
[Business Inquiries] bycloud@smoothmedia.co
[Music] @IraStoria 
[Profile & Banner Art] https://twitter.com/pygm7
[Video Editor] Abhay and @Booga04 
[Ko-fi] https://ko-fi.com/bycloudai

ByCloud's resource offers insights, tutorials, and resources for cloud computing enthusiasts, developers, and IT professionals. Readers can learn about cloud architecture, DevOps practices, and cloud-native technologies. With articles, tutorials, and case studies, ByCloud provides  guidance and expertise for leveraging cloud computing to build scalable and resilient applications.

bycloud

Token consumption in LLMs has exploded with thinking models and AI agents, creating scalability challenges. Standard attention mechanisms scale quadratically with context length, making long contexts prohibitively expensive. Three approaches attempt to solve this: sparse attention (restricts which tokens interact), linear attention (accumulates information in shared memory), and compressed attention (compresses tokens before comparison). While sparse and compressed attention help, only linear attention can theoretically scale past 1M context windows. Recent developments show hybrid approaches combining linear attention with standard or compressed attention achieving promising results, with Google's Gemini 3 Flash demonstrating breakthrough performance at 1M context length.

LLM’s Billion Dollar Problem