This post introduces an efficient service for long context LLM called Infinite-LLM. It utilizes a novel distributed attention algorithm called DistAttention and a distributed KVCache management system called DistKV-LLM. The system improves end-to-end throughput and supports significantly longer context lengths compared to
Sort: