Why can’t LLMs just LEARN the context window?

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

A research paper called 'End-to-End Test-Time Training for Long Context' proposes storing context information directly into a transformer's MLP weights via gradient updates at inference time, rather than relying on full attention or KV cache. The approach combines sliding window attention for local context with test-time weight updates for long-range context, making it computationally feasible through batching. Experiments show it tracks full attention performance closely at 32K–128K token contexts and maintains a consistent loss advantage. However, precise needle-in-a-haystack retrieval remains a weakness compared to full attention, since information is compressed into weights without indexing. The method is presented as a promising new direction toward effectively infinite context windows through continual learning.

12m watch time
1 Comment

Sort: