This post introduces an efficient service for long context LLM called Infinite-LLM. It utilizes a novel distributed attention algorithm called DistAttention and a distributed KVCache management system called DistKV-LLM. The system improves end-to-end throughput and supports significantly longer context lengths compared to existing systems. It addresses the challenges faced by LLM services in cloud environments, paving the way for more robust and scalable LLM cloud services.
Sort: