Why can’t LLMs just LEARN the context window?
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
A research paper called 'End-to-End Test-Time Training for Long Context' proposes storing context information directly into a transformer's MLP weights via gradient updates at inference time, rather than relying on full attention or KV cache. The approach combines sliding window attention for local context with test-time weight updates for long-range context, making it computationally feasible through batching. Experiments show it tracks full attention performance closely at 32K–128K token contexts and maintains a consistent loss advantage. However, precise needle-in-a-haystack retrieval remains a weakness compared to full attention, since information is compressed into weights without indexing. The method is presented as a promising new direction toward effectively infinite context windows through continual learning.
•12m watch time
1 Comment
Sort: