How the KV cache gives every AI conversation a physical weight in silicon, and what happens when the memory runs out.

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

A deep technical and philosophical exploration of the KV cache in large language models — how it works, how it has evolved across architectures (GPT-2, Llama 3, DeepSeek V3, Gemma 3), and what its limitations mean for AI memory. Covers the physical cost of conversation state in GPU memory, cache eviction and prompt caching pricing, context rot in long conversations, the compaction problem, and external memory workarounds. Closes with a sci-fi reflection on Greg Egan's Diaspora and the trajectory toward AI systems that manage their own memory.

The Weight of Remembering