Check out HubSpot's FREE 2026 Guide to AI Agents: https://clickhubspot.com/3972be

In this video, I'll be breaking down a new approach to long-context LLMs called test-time training (TTT-E2E), where models store past context directly in their weights instead of relying on attention or KV caches. Kind of like meta learning, but with gradient descent.

my latest project: Intuitive AI Academy
We just wrote a new piece on MoE!
https://intuitiveai.academy/
limited time code "EARLY" for 40% off yearly plan!


TTT-E2E
[Paper] https://arxiv.org/abs/2512.23675

Appeared papers
[Titans] https://arxiv.org/abs/2501.00663
[Kimi Linear] https://arxiv.org/abs/2510.26692


My Newsletter
https://mail.bycloud.ai/

my project: find, discover & explain AI research semantically
https://findmypapers.ai/

My Patreon
https://www.patreon.com/c/bycloud


Try out my new fav place to learn how to code https://scrimba.com/?via=bycloudAI

This video is supported by the kind Patrons & YouTube Members: 
🙏Spam Maj, Alex, Chris LeDoux, DX Research Group, Poof N' Inu, Deagan, Robert Zawiasa, Ryszard Warzocha, Tobe2d, Louis Muk, Akkusativ, Kevin Tai, Mark Buckler, NO U, Tony Jimenez, Ângelo Fonseca, jiye, Anushka, Asad Dhamani, Binnie Yiu, Calvin Yan, Clayton Ford, Diego Silva, Etrotta, Gonzalo Fidalgo, Handenon, Hector, Jake Disco very, Michael Brenner, Nilly K, OlegWock, Daddy Wen, Shuhong Chen, Sid_Cipher, Stefan Lorenz, Sup, tantan assawade, Thipok Tham, Thomas Di Martino, Thomas Lin, Richárd Nagyfi, Paperboy, mika, Leo, Berhane-Meskel, Kadhai Pesalam, mayssam, Bill Mangrum, nyaa, Toru Mon, Lame Plane, Matej Macak, Len Mo, saylikhapekar, Zyansheep


[Discord] https://discord.gg/NhJZGtH
[Twitter] https://twitter.com/bycloudai
[Patreon] https://www.patreon.com/bycloud
[Business Inquiries] bycloud@smoothmedia.co
[Profile & Banner Art] https://twitter.com/pygm7
[Video Editor] @Booga04 
[Ko-fi] https://ko-fi.com/bycloudai

ByCloud's resource offers insights, tutorials, and resources for cloud computing enthusiasts, developers, and IT professionals. Readers can learn about cloud architecture, DevOps practices, and cloud-native technologies. With articles, tutorials, and case studies, ByCloud provides  guidance and expertise for leveraging cloud computing to build scalable and resilient applications.

bycloud

A research paper called 'End-to-End Test-Time Training for Long Context' proposes storing context information directly into a transformer's MLP weights via gradient updates at inference time, rather than relying on full attention or KV cache. The approach combines sliding window attention for local context with test-time weight updates for long-range context, making it computationally feasible through batching. Experiments show it tracks full attention performance closely at 32K–128K token contexts and maintains a consistent loss advantage. However, precise needle-in-a-haystack retrieval remains a weakness compared to full attention, since information is compressed into weights without indexing. The method is presented as a promising new direction toward effectively infinite context windows through continual learning.

Why can’t LLMs just LEARN the context window?