NVIDIA Research introduces TTT-E2E (Test-Time Training End-to-End), a novel approach that enables LLMs to compress context into their weights during inference through next-token prediction. Unlike traditional transformers with full attention that scale poorly in latency, or RNNs that lose accuracy with longer context, TTT-E2E achieves both constant inference latency and improved loss at extended context lengths. The method is 2.7x faster than full attention at 128K context and 35x faster at 2M context on H100 GPUs, while maintaining better accuracy than existing approaches. The technique uses meta-learning during training to prepare model initialization for test-time adaptation, effectively mimicking how humans compress experience into intuitive knowledge rather than maintaining perfect recall of every detail.
Table of contents
How does LLM memory differ from human memory?Our method: compressing context into weightsWhat will be the role of RAG?LimitationsConclusionSort: