Cloudflare details the engineering behind running trillion-parameter LLMs like Kimi K2.5 on Workers AI. Key optimizations include prefill-decode disaggregation (achieving 3x improvement in inter-token latency), session-affinity-based prompt caching (boosting cache hit ratios from 60% to 80%), Mooncake Transfer Engine for cross-GPU KV cache sharing via RDMA, and NVIDIA EAGLE-3 speculative decoding for faster tool call generation. Their proprietary Rust-based inference engine Infire supports multi-GPU tensor and pipeline parallelism, can boot Kimi K2.5 on 8 H100s in under 20 seconds, and delivers up to 20% higher throughput than vLLM with significantly lower memory overhead.
Table of contents
Hardware configurationsInfire: our proprietary inference engineThe journey doesn’t endSort: