We built a custom technology stack to run fast large language models on Cloudflare’s infrastructure. This post explores the engineering trade-offs and technical optimizations required to make high-performance AI inference accessible.

Cloudflare's platform is a leading provider of internet security and performance solutions, offering insights into web security, content delivery, and DNS management. Through documentation, blog posts, and webinars, Cloudflare provides insights into protecting websites and applications from cyber threats and improving performance. Developers and IT professionals can learn about CDN (Content Delivery Network), DDoS mitigation, and firewall configurations to secure and accelerate web traffic.

Cloudflare

Cloudflare details the engineering behind running trillion-parameter LLMs like Kimi K2.5 on Workers AI. Key optimizations include prefill-decode disaggregation (achieving 3x improvement in inter-token latency), session-affinity-based prompt caching (boosting cache hit ratios from 60% to 80%), Mooncake Transfer Engine for cross-GPU KV cache sharing via RDMA, and NVIDIA EAGLE-3 speculative decoding for faster tool call generation. Their proprietary Rust-based inference engine Infire supports multi-GPU tensor and pipeline parallelism, can boot Kimi K2.5 on 8 H100s in under 20 seconds, and delivers up to 20% higher throughput than vLLM with significantly lower memory overhead.

Building the foundation for running extra-large language models