NVIDIA Research introduces TTT-E2E (Test-Time Training End-to-End), a novel approach that enables LLMs to compress context into their weights during inference through next-token prediction. Unlike traditional transformers with full attention that scale poorly in latency, or RNNs that lose accuracy with longer context, TTT-E2E
Table of contents
How does LLM memory differ from human memory?Our method: compressing context into weightsWhat will be the role of RAG?LimitationsConclusionSort: