Reimagining LLM Memory: Using Context as Training Data Unlocks Models That Learn at Test-Time

NVIDIA Research introduces TTT-E2E (Test-Time Training End-to-End), a novel approach that enables LLMs to compress context into their weights during inference through next-token prediction. Unlike traditional transformers with full attention that scale poorly in latency, or RNNs that lose accuracy with longer context, TTT-E2E achieves both constant inference latency and improved loss at extended context lengths. The method is 2.7x faster than full attention at 128K context and 35x faster at 2M context on H100 GPUs, while maintaining better accuracy than existing approaches. The technique uses meta-learning during training to prepare model initialization for test-time adaptation, effectively mimicking how humans compress experience into intuitive knowledge rather than maintaining perfect recall of every detail.

#machine-learning

#deep-learning

#llm

#nvidia

#transformers

Jan 09•5m read time•From developer.nvidia.com

Table of contents

How does LLM memory differ from human memory?Our method: compressing context into weights What will be the role of RAG?Limitations Conclusion

Comment

Bookmark

Copy

Sort: